Statistical Inference I 
Shirsendu Mukherjee 
Department of Statistics, Asutosh College, Kolkata, India. 
shirsendu_st@yahoo.co.in 


Introduction to statistical inference 

In the present module we are going to introduce the concept of point estima- 
tion which is a part of statistical inference. Statistical inference is the process 
of going from information gained from a sample to infer about a population 
from which the sample is taken. There are two aspects of statistical inference 
that we will be studying in this course: (i) Estimation and (ii) Hypothesis 
testing. In an estimation problem some features of the population in which 
an enquirer is interested may be completely unknown to him, and he may 
want to make a guess about this feature completely on the basis of a random 
sample from the population. There are two types of estimation problem : (i) 
Point estimation and (ii) Interval estimation. In this lecture we shall discuss 
some preliminary concept of point estimation.Let us start our discussion with 
a brief history of estimation problem. 


Historical Perspective The problem of estimation arose in a very natural 
way in problems of Astronomy and Geodesy in the first half of the 18-th 
century. For example in Astronomy, the determination of interplanetary 
distances, determining the position of planets and their movements in time 
were some of the important problems. Whereas in Geodesy, determining the 
spheroidical shape of the earth was one of the most important problems. It is 
known that the figure of the earth is almost a sphere except for some flatness 
near the poles. Observations were obtained on the measurement of the length 
of one degree of a certain meridian and the problem was to determine the 
parameters say a and 8 which specified the spheroid of the earth. Indirect 
observations on (a, 3) were given by the relation 


Y; ud4DPxbe132,4m 


Where z;'s are known fixed constants. Note that (o, 8) are uniquely deter- 
mined if only two observations on Y at different values of (x1, 2) are avail- 
able.However, as is customary in science, several observations were made at 
different values (£1, £2,... Zn) and this led to the theory of combination of 
observations with random error which directly or indirectly measured ” mag- 
nitudes of interest or parameters. To estimate œ and 5 on the basis of the 


given data the first attempt was made by Rogerr Boscovich(1757) in course 
of a geodetic study of ellipticity (extent of flatness at the poles) of the earth. 
He suggested that the estimates of (a, 8) are to be determined such that 

(i) The sum of positive and negative residuals or errors should balance i.e. 
Ma(y — a — px) = 0 and 

(ii) Subject to the above constraint we determine (o,8) such that R = 
X; ly; — a — Bx;| The sum of absolute values of errors e; = y; — a — 32; is 
as small as possible. 

Using geometric argument Boscovich solve the problem for the five obser- 
vations that he had. Laplace (1789) gave a general algebric algorithm to 
obtain estimates of (œa and 8) on the above principles for any number of 
observations. This problem was later solved by Gauss and Legendre using 
the method of least squares. 

Boscovich has made the assumption that the errors of overestimation and un- 
derestimation must balance out. This idea was used by so many researchers in 
future time. For estimating the parameter 0 in the simplest model Y; = 04-ej, 
Simpson(1776) used this idea by assuming that errors are symmetrically uni- 
formly distributed about zero or the probability density function of the error 
is given by f(e) = 3, —h < e < h,h > 0. Euler (1778) proposed the arc 
of a parabolic curve given by f(e) — zs (r? —e)-r«ecr,r 0 as the 
pdf of the random error. Laplace suggested the probability density function 
f(e) = zerp E: ,—0o < e < oo. As the model for distribution of errors 
and Gauss proposed the normal distribution with probability density func- 
tion f(e) — NS Ea ,—0oo < e < oo. It is important to point out here 
that the double exponential distribution used by Laplace to represent error 
distribution led to the median of the sample of the "best" estimator of the 
”True Value" of the parameter 0 whereas the normal distribution used by 
Gauss led to the mean of the sample as the "best" estimator of the "True 
value". 

Theory of Point estimation 

Background 

We consider a random experiment E. The outcome of E is represented by a 
Observable random vector X = (Xi, Xo,..., Xn) n > 1. A particular value 
of X is denoted by æ = (z1, £2,..., £n). The character X could be real or 
vector valued and the set of all values of X is called the sample space and it 
is denoted by X C R”. 

The random vector X is generated by F(a) = P(X < x), x € X, the dis- 
tribution function of X. 


In a parametric point estimation problem we assume that the functional form 
of F(x) is known except perhaps for a certain number of parameters. Let 
0 = (041,05,...,0,) be the unknown parameters associated with F(a). The 
parameter 0 may be real valued or vector valued and is usually called a la- 
belling or indexing parameter. The labelling parameter 0 varies over a set of 
values, called as parameter space and is denoted by © C R*. So F(a) can 
be looked upon as a function of 0 and henceforth we will write it as Fo(a). 

If X is discrete or absolutely continuous then F(z) is generated by fo(a), the 
probability mass function(p.m.f.) or of probability density function (p.d.f.) 
X. We write Fy = {p(x, 0) : 0 € O}, as the class of all probability mass or 
density functions. The object of inference is the parameter 0 or a function of 
the parameter 0 say g(0), which is of interest. Let us consider few examples. 
Example 1 Suppose a coin is tossed 50 times. 

The outcome of ith toss can be described by a random variable X; such that 
X; = lor 0 according as the ith toss results in a head or a tail. 


X = ((x1, 29 950); E 0 or 1 for all i) 
If 0 be the probability of getting a head in any toss then © = (0, 1) and the 
probability function of X is p(x,0) = Ios — 0)", 2 € X,0 c O. We 
i=1 


may want to estimate 0 or any function of 0. 


Example 2 Suppose that 100 seeds of a certain flower were planted one in 
each pot and let X; equal one or zero according as the seed in the ith pot ger- 
minates or not. The data consists of (£1, £2, ... 100) a sequence of ones and 
zeroes and is regarded as a realization of (X1, X2, ... X199) such that compo- 
nents are i.i.d. random variables with P[X; = 1] = 0 and P[X, 20] 2 1— 6, 
where 0 represents the probability that a seed germinates. The object of 
estimation is @ itself or a function g(0) that may be of interest. For example, 
consider g(0) = i3. 05(1— 0)?, which is the probability that in a batch of 10 
seeds exactly 8 seeds will germinate. 


Example 3 In a pathbreaking experiment Rutherford Chadwick and Elis(1920) 
observed 2608 time intervals of 7.5 seconds each and counted the number of 
time intervals N, in which exactly r number of a particles hit the counter. 
They obtained the following table 

r d i1 2 3 4 5 6 7 8 9 >10 

N, 57 203 383 525 532 408 273 139 45 27 18 


It is quite well known that the Poisson distribution with p.m.f fg(r) = 
exp(- Qe". x —1,2,...0 > 0 serves as a good model for the number of times a 
given event E occurs in a unit time interval. If X; denotes the number of a 
particles hitting the counter in the 4 — th time interval then (X1, X2,... Xn) 
where n = 2608 are i.d. Poisson random variables with parameter 0. We 
may want to estimate 0 on the basis of the given data. 


Example 4 Consider determination for an ideal physical constant such as 
gravity g. Usual way to estimate g is by the pendulum experiment and 
observe X — und where / is the length of the pendulum and T the time re- 
quired for a fixed number of oscillations. Due to variation which depends on 
several factors such as the skill of the experimentor and measurement errors, 
the i — th observation X; = g + e; where e; is the random error. Assuming 
the distribution of error is normal with zero mean and variance o? we have 
X4, X5,..., X, are iLi.d. N(g,o?). Here the parameter 0 is a two dimensional 
vector, 6 = (g,o?). Here we can view estimation of g. On the other hand 
one may be interested in estimating the error variance c? through which we 


can estimate the ability of the experimenter. 


Example 5 Suppose an experiment is conducted by measuring the length 
of lives in hours of n electric bulbs produced by a certain company. 
Let X; be the length of live for the ith bulb. 


X = { (£1, Bf)" - Zn) : £i 0 for all i} 


If we assume that the distribution of each X; is exponential with mean 0 then 
© = (0, oo) and and the probability function of X is p(æ, 0) = He". € 


4,0 € O. We may want to estimate the parameter 0 or g(0) = e$, which 
represents the probability that the lifetime of a bulb will be at least 60 hours. 


Objective 

The distribution of X is characterized by the unknown parameter 0 about 
which we only know that it belongs to the parameter space O. To discuss 
the problem of point estimation, for the sake of simplicity, we consider the 
case when the parameter of interest is a real valued function g = g(@) of 0. 
In point estimation we try to approximate g(0) on the basis of the observed 
value a of X. In other words we try to put forward a particular statistic or 
a function of X, say T = T(X), which would represent the unknown g(@) 


4 


very closely. Such a statistic T is called an estimator or a point estimator 
of g(0). Mathematically, T is a measurable mapping from X to the space 
of g(@) and it is called an admissible estimator. Any observed value of T is 
called an estimate of g(0). In a nutshell, a point estimate of a parameter 0 
is a single number that can be regarded as a sensible value for 0. A point 
estimate is obtained by selecting a suitable statistic and computing its value 
from the given sample data. The selected statistic is called a point estimator 
of 0. It is to be noted that for a particular estimator T' for a parameter 0 the 
estimate of 0 may vary from sample to sample. 


Example Suppose we want to estimate 0 in Example 1.We may use the 
statistic T' — iX, as an estimator of 0. Here T is a mapping from X to 


i=l 
(0,1) and it is admissible. 


Statistical Inference I 
Shirsendu Mukherjee 
Department of Statistics, Asutosh College, Kolkata, India. 


shirsendu_st @yahoo.co.in 


In any given problem of estimation we may have a large, often an infinite class of 
competing estimators for g(0), a real valued function of the parameter 0.The follow- 
ing question that may arise is : Are some of many possible estimators better, in some 
sense, than others? In this section we will define certain criteria, which an estimator 
may or may not possess, that will help us in comparing the performances of rival 


estimators and deciding which one is perhaps the "best". 


Closeness 

If our object is to estimate a parametric function g(0) then we would like the estimator 
T (X) to be close to g(0). Since T(X) is a statistic, the usual measure of closeness 
IT CX) — g(0)| is also a random variable and as a measure of closeness of T we use the 
measure P,(| Ti — g |< €) for some e > 0. 

Consider two estimators 7; and T5 for estimating a parametric function g = g(0) of 
0. The estimator 7T; will be called more concentrated estimator of g(0) than T» if Ti 


if for every e > 0 
Po(| T1 — g |< e) > P (| T5— 9 |< e), for all 0€ O (1) 
Result : A necessary condition for (1) to hold is that 
Fo(Ti — g) < E&(Ti — ay forall 0€ 0 


provided F(T; — 0)? exists for all i=1,2. 
Proof (For continuous case): We know that for every non-negative random variable 
X such that E(X) exists, 

E(X) = [ P(X > x)dz. 


1 


Since (T; — 0)? is a non-negative random variable we get 
Er — 9)? = f P(T- g |> 9de 


It follows that 


oo 


Ey(Ti - 9? - Es - 9f = [^ [Pall - ale © - PUT - g |< ode 
Hence the inequality Eo(Tı —g)? € E9(T2—g)? for all 0 € © implies that P;(| T5—9 |< 


€) € Po(| Ti — g |< €) for all e > 0. 


Mean-squared Error(MSE) 
If T be an estimator of g then the MSE of T is defined by 


MSE,(T) = Eo(T — g9}, for all 0 € © 


The term (T — g) is called the error of T in estimating g and thus E(T — g)? is called 
the mean square error of T. It measures the average squared difference between the 
estimator 7' and the parameter g. From the above result it is clear that smaller the 
value of MSE the better is the estimator. Naturally, we would prefer an estimator 
with smaller or smallest MSE. If such an estimator exists it will be best for the 
parameter g. 

An estimator T is said to be best for g if MSE(T) < MSE,(T") for all 0 € © for 
any other estimator 7" of g. But the problem is that no such best estimator will 
exists in this sense. It will be clear from the following discussion. 


Let for a particular value of 0, say 0o, T” be defined as 
T'—g(09) forall rex 
Then 


M S Ea, (T^) En, [9 (00) m g(09)]? =0 


= MSE,(T) = 0 


Hence T = g(09) with probability 1. Since 6o is arbitrary for any 0 


T — g(0) with probability 1 


But T' being a statistic it can not be a function of the unknown 0. Hence such a best 
estimator does not exist. 

Consider the following example. 

Example Let X;,X»,..., X, be iud. N(6,1),0inRrandom variables. To estimate 
@ let us consider two estimators T = X = 1 57? X; and T" = bo. 

Then we get 

MSE,(T) = 1 and MSE@(T") = (09 — 6). 

Now for values of 0 € [05 — Jabo + E we have MSE,(T’) x MSE,(T) and for 
other values of 0 we have MSEsS(T") > MSEs(T). 

Here T" is not a good estimator of 0 sine it always estimate 0 to be ĝo and it does 
not depend on observations at all. On the other hand the estimator T utilizes the 
observations and therefore it is better than T”. 

From the above discussion it is clear that if MSE is the only criterion in search for a 
good estimator then there may be some "freak estimators (like 7") that are extremely 
prejudiced in favour of a particular values of 0 and they would perform better than a 
generally good estimator estimator at those points. For instance, in the above exam- 
ple the estimator 7" is highly partial to ĝo since it always estimate 6 to be ĝo. One 
could restrict such freak estimators by considering only estimators that satisfy some 


other property. One such property is that of unbiasedness. 


Unbiasedness 


Definition An Estimator T is said to be an unbiased estimator (UE) of g(@) if 


Eo(T) = g(0) for all 0€ © 


If T is not an unbiased estimator of g(0) then the bias of T is defined by 
Bo(T) = E«(T) — 9(0),0 € ©. 


The following result shows a relationship between MSE and variance of an estimator 


in terms of the bias. 


Result MSEs(T) — Vare(T) = Bz(T),0 € ©. 
Proof 


MSE,(T) = E&(T —39(0) 
= Eg|(T — Eo(T)) + (EXT) = 9(8))]° 

= Eş (T — Eo(L))” + (Eo(L) — 9(0)). + 2 (E(T) — 9(8)) Eo (T — Eo(T)) 

= ET VAM 26 (E 


= Var,(T) + Bz(T) 


Thus, MSE incorporates two components one measuring the variability of the esti- 
mator(precision) and the other measuring its bias (accuracy). An estimator that has 
good MSE properties has small combined variance and bias. To find an estimator 
with good MSE properties, we need to find estimators that control both variance and 
bias. Clearly, unbiased estimators do a good job of controlling bias. For an unbiased 
estimator T we have MSE,(T) = Varg(T). 

Although many unbiased estimators may be reasonable from the standpoint of MSE, 
but controlling bias does not guarantee that MSE is controlled. In some cases a 
trade-off occurs between the variance and the bias in such a way that a small increase 
in bias can be traded for a larger decrease in variance, resulting a smaller MSE. It is 
clear from the following example. 

Example 1 Let X; ~ N(u,o?)i = 1,2...n independently where u and c? are un- 
known. 


Consider all estimators of o? of the form T = cS? where c > 0 is a constant and 


g2— AEX — X} 


Now 
MSE, (cS? — 07)? = E,(cS? — 0”)? = C E,(S*) — 2cE,(S?) + o* 
Since = ~ X2, 
n — 1)8? n —1)$8? 
E. (25) = (n — 1), Vs (e. = 2(n — 1) 


which gives E,(S?) = o? and V,(S?) = 2%. After some routine algebra we get 


= n—1l* 


4n 1 
n—l 


MSE,(c8? — o? = ge r -2c+ 1| 


n— 


which attains the minimum when c = ae The minimum value being 


+ 
20^ 25^ 
« 
n+1 nA 
and hence T = — X (X; — X)? has smaller MSE than the unbiased estimator S? of 
ici 
c?. But T is not unbiased for c? since E,(T) = 270°. If we use the criterion of MSE 


the estimator T which is a biased for o? is better than the unbiased estimatorS? of o?. 


Example 2 Let X; ~ N(0,1)i = 1,2...n independently where |0| < 1. Here X is 


unbiased for 0. Let us consider the following estimator T' of 0 such that 


T = —LDEGX-ed 
= XT |X |e 
= 1ifX>1 


Then 
P(T) = P(X > 1) - P(X < —1)+ n 5B, 


where f(z) represents the p.d.f. of X. Hence T is a biased estimator of 0 and the 
MSE of T is given by 


MSE,(T) = (1 — 92 (X > 1) + (C1— 6 P(X « -1) + fa — 0? f(z)dz 


5 


For the estimator X we have 


E9(X) — 6. 


Hence X is an unbiased estimator of 6 and 


MSE,(X) = L (z — 0)? f (z)dz. 


—oo0 


Now 


» 


MSE,(X)— MSEQT) = f " (E 0Pf(z)dz — MSEy(T) 


f. [e - 0?» c1 - oy] reas « [^ [@- 9)? - a - oy] reas 


—oco 


I, + I5, say. 
In the first integral J, 

zc 49(7—9) £1-1-90)x 02 (z -0 > (-1—9Y. 
In the second integral Jo, 

& D> Fa) ¥(-1-0) > 0 => (z—0Y > (-1-0)”. 


Hence we get J; + I> > 0. Thus T is a biased estimator of 0 but the MSE of T is less 
than that of X. 

Note In both the examples a natural question may arise : which then should be 
preferred? The answer obviously depends on the purpose for which an estimate is 


obtained. 


Statistical Inference I 
Shirsendu Mukherjee 
Department of Statistics, Asutosh College, Kolkata, India. 


shirsendu_st @yahoo.co.in 


Estimable Function: A parametric function g(@) is called estimable if there exists 
an atleast one unbiased estimator of g(0). 
Result: If X ~ Bin(n,0) the estimable functions of 6 are polynomial functions of 0 
of degree n or less. 
Proof: Let g(0) be an estimable function of 0. Then there exists a statistic T(X) 
such that 

Eo(T(X)) = g(0) for all 0 € (0, 1) 


= È T(z) (") 0*(1— 0)"7* = g(0) for all 0 € (0,1) (2) 


Since the L.H.S. of (2) is a polynomial in 0 of degree n or less, to hold the identity 
the R.H.S. must be a polynomial in 0 of degree n or less. 

Note: If X ~ Poisson(0), the estimable functions of 6 are convergent power series in 
0. 

Example 1: Consider a random variable X which takes only two values 1 or 2 with 
respective probabilities 0? + 0? and 1 — 6? — 68. Here 0 is an unknown parameter 
whose only possible values are 1/3 and 2/3. On the basis of a single observation X, 
we want to see whether the function 5 is estimable or not. 


If possible suppose E is estimable. Then there exists a statistic T(X) which is an 


unbiased estimator of z. Hence 
1 
E(T) = 9 for 0 = 1/3,2/3 


> T(1)(@? + 6°) + T(2)(1 — 8? — 6°) = for 0 = 1/3, 2/3 


We get 


20 7 
—T (1 —T(2) = 3/2 
OT (1) + LT (2) = 3/ 
Two equations are satisfied when T(1) = 27 and T(2) = Z. Hence there exists an 


unbiased estimator of i that is I is estimable. 

Example 2 : Let X; ~ N(u,o?)i = 1,2...n independently where u and o? are 
unknown. 

Let 0 = (u, o?) and g(0) = e-^*3**. 

From the m.g.f. of the normal distribution 

Eg(e” ) = et^* 3*9 for all 0 and t 


Setting t = —1 we get Es(e ^) = e-^*3*' for all @ and hence 
ln. _x, 1 
Eig (GÈ e Y = e +2" for all. 8. 


Thus there exists an unbiased estimator of g(0) and hence g(@) is estimable. 

Best linear Unbiased Estimator (BLUE) 

Let Ti,T5,... Tk be k unbiased estimators of g(0). Consider a linear combination of 
Ti,T5,... Ty ,namely, T = SUT; where 1,,l2...l, are real numbers. For T to be 
unbiased we must have 


Eo(T) = g(0) for all 8 
=> Š GET) = g(0) forall 0 
= g(0) Èl; = g(0) for all 6 


k 
=> Ðl; = 1, by equating the coefficients of g(0) 
i=l 


Such an unbiased estimator T is called a linear unbiased estimator of g(0). The esti- 
mator T will be called the BLUE for g(@) if it has the minimum variance among all 
linear unbiased estimators of g(0). 

Determination of BLUE : For the sake of simplicity we assume that T,’s are in- 


dependent. Let Varg(T;) = o? for alli = 1,2,...k. Then the variance of T is given by 


k k 
Our problem is to minimize X; 2e? subject to the constraint X l; = 1. 
i=1 iz 


By Cauchy-Schwarz inequality, for any two sets of real numbers aj, d2,...a, and 


bi, b2, . . . bk 
ko ka k 2 
P (X8) - PD 
i=1 i=1 i=1 


equality sign holds if a, = cb; for all i = 1,2,...k, where c is a constant. 


Setting a; = l;o, and b; = = for all i = 1,2,...k we get 


2 
( 3020?) PE > (34) 
i=l i=10; i=1 
1 


k 

=> (X fte?) > -~ 
Ki 
i—1l^i 


2 
i=1 7% 


The minimum value of Varg(T) is and the minimum is attained if lio? = c for 


k 
all ¿ implying that l; = 5 for all i. Using the constraint X l; = 1 we get c = 2 N 
i i=1 
i=17% 
k 
X Tiwi 
Wi i=1 


and hence l; = 


z — where w; = E for all i. The BLUE for g(0) is given by =; 
Es | x 
which is the weighted mean of T;'s with w;'s as weights. 
k 
XT; 
Note : If Vare(T;) = c? for all i = 1,2,...k then T = =— which is the simple 
arithmatic mean of T;j's. 
Some remarks on unbiased estimator: 
Remark:1.An Unbiased Estimator may be inadmissable. 
Example: Suppose X ~ P(0), g(0) = e~ 7,8 > 0 


We want to find an unbiased estimator of g(0). 


Let T(X) be an unbiased estimator of g(0). Hence 


EQ(T(X)) 2 g(0) for all 0-0 


—0 px 
a EES, 


x! 


=> 3. (x) 


oo 


»YXTn-Yce 


— forall 6>0 

x! 
x=0 
Comparing the coefficients of a from both sides of the above identity we get T(x) = 
(—2)* for all x. Hence T(X) = (—2)* is the unique unbiased estimator of g(0) but it 
is improper since g(0) is a monotone decreasing continuous function of 0 taking only 
non-negative values whereas T' is a an oscillatory function of X taking both positive 
and negative values. 


Remark:2. An unbiased estimator may not exists. 


Example: Let X be a random variable with p.m.f. 
fe(z) = (1— 0)70,x = 055,...00,0« 0< 1 


Suppose our object is to find an unbiased estimator of 0. If possible suppose T( X) 


be an unbiased estimator of 6.Then 
Eo (T(X)) 20. forall 0«0«1 


=> MT(z)(1—80)0—80 forall 0«0«1 
x=0 


=> MU(r(1—-8)-1 foral 0«60«1 
x=0 


Such an identity can not hold since the L.H.S is a function of 0 but the R.H.S is 
independent of 0. 

Remark:3. An unbiased estimator may not be unique. 

Example: Let X; ~ Rect(0,0),i = 1,2...n independently If X = ib and 
Xn) = Maz(Xi, X2,... Xn) then Tı = 2X and Tj = BE X) are unbiased estimators 
of 0. 

Proof: Since X; ~ Rect(0,0),i=1,2...n, 

Eo(X;) = § for all 0 > 0 and for all i = 1,2,...7. 

It follows that Eo(X) = £ for all 0 > 0 and hence E4(T;) = 0 for all 0 0. 

The p.d.f. of Xin) is fe(zq) = Ae 20 < tiny «1 

Hence Eo(X(n)) = 2510 for all 0 > 0 and hence Eo(T2) = 0 for all 0 > 0. 


4 


Remark 4. An unbiased estimator is not invariant under a transformation i.e.if T' is 
an unbiased estimator of 0 then g(T') is not unbiased for g(0). 

Example : If g is convex downward then by Jensen's inequality Eg(g(T)) > g(E(T)) 
and hence Ej(g(T)) > g(0) for all 0. For example suppose T is an unbiased estimator 
of 0 and g(0) = 0?. Since g is convex downward we have E¢(g(T)) > 0? for all 0 i.e, 
T has an upward bias. Note that g is convex upward then Eọ(g(T)) € g(0) for all 0. 
For example suppose g(0) = 2,0 4 0. Since g is convex downward E&(g(T)) < 4 for 
all 0 i.e, T' has a downward bias. 

A practical example: 

1. Consider a coin with probability of getting a head in a toss is equal to 0,0 < 0 « 1, 
is unknown. To estimate i unbiasedly the following inverse sampling methodology is 
adopted: The coin is tossed repeatedly until a head appears and the the experiment 
is performed independently for 20 times. From the outcomes of 20 trials it is found 
that the first head appears in 1st, 3rd, 5th, 1st, 2nd, 1st, 3rd, 7th, 2nd, 4th, 4th, 8th, 
1st, 3rd, 6th, 5th, 2nd, 1st, 6th and 2nd toss. 

Let X; be the number of trials required to get the first head in the ith trial,i=1,2, 
...20. The the p.m.f. of X; is given by 


folz) —(1—9)10,5, = 1,2,...;0 <0 «1 


Then 
E(X) = Xel — 670 
x=1 
= 0(1 — (1— 0))? 
1 
x for all i 


20 20 

Hence E, (83x) = 2. for all 0 From the give data we have 5 = 67 and hence 
i=1 i=1 

an unbiased estimate for t is given by 3.35. 

Note: From Remark 3 it is clear that it is not possible to find an unbiased estimate 


of 0 from the above data set. 


Statistical Inference I 
Shirsendu Mukherjee 
Department of Statistics, Asutosh College, Kolkata, India. 


shirsendu_st @yahoo.co.in 


Uniformly Minimum Variance Unbiased Estimator (UMVUE) 

Let g(0) be an estimable function of the unknown parameter 0 and U, = {T( 
X) : Eo(T) = 9(0),0 < Vo(T) < oo,0 € 0) be the class of all unbiased estimators of 
g(0) with finite variances. An estimator T' € U, is said to be a uniformly minimum 
variance unbiased estimator of g(0) if Varg(T) < Vare(Z") for all 0 € OG, T' € U}. 
Method Of Covariance 

Let Up = (h( X) : Ey(h) = 0,0 < Vo(h) < co, 0 € 0} be the class of all unbiased esti- 
mators of g(0) with finite variances. Then an T' € U, is the UMVUE of g if and only 
if Cove(T, h) = Eg(Th) = 0 for all 0 € © and for all h € Up i.e, T is uncorrelated 
with every unbiased estimator of 0. 

Proof : Only if part 

Let T € U, be the UMVUE of g(0) and T* be any other statistic such that 

T* = T + Ah where h € Up and A is a fixed constant. 

Therefore E9(T*) = g(0) for all 0 € O 

=> T* € U, 


Since T is the UMVUE 


Varg(T) € Vare(I*) = Varo (T) + MVare(h) + 2A4Cove(T, h) for all 0 
=> \’Vare(h) + 2ACovo(T, h) > 0 for all 0 (1) 


Now if Covs(T, h) # 0 for some 0, € © let us choose A = —VIBRDU. Then for 0 = 65 
the L.H.S of (1) becomes 


—Cov; (T, h) 


2 z 
A Varg, (h) + 2ACove,(T, h) = Vara 


«0 


This contradicts (1) for 0 = 4. 
Hence Cove(T, h) = 0 for all 0 € ©. 
If part Suppose T € U, be an estimator of g(0) such that Covg(T, h) = 0 for all 0 € 
O and for all h € Up Let T* be another unbiased estimator of g(0) and we define 
ht =T* —T 
Then Eg(h*) = Eo(T*) — Eo(T) = 0 for all 0 € O 


=> h* € Uo 


=> Coug(T, h^) =0 for all 0 € © 
=> Cov(T,T* — T) =0 for all 0 € © 


Now varg(T*) = Varg(T* — T + T) = varg(T* — T) + varg(T) 


> vare(T for all 0c0 


Since 7™ is arbitrary, T has the minimum variance among all unbiased estimators of 
g(0) i.e. T is the UMVUE of g(6). 

Example 1: Let X;,X5,... X, bea ii.d. N(0,1) random variables where —oo < 
0 < oo is unknown. Suppose we want to find the UMVUE of 0. The joint p.d.f of 
X1, X2, ... Xn is given by 


1 —] n 
d (aay? | 2.2099 eR 
Let h(X) be an unbiased estimator of 0 i.e. 
Eo(h(X)) — 0 for all 0 
= | "(foa =0 for all 6 


Differentiating w.r.t. 0 we get 


/ h(a) 5 (X; — 0) folar)dx = 0 for all 6 


= f ^t (z — 6) fo(a)dx = 0 for all 0 
= OUO — 0 for all 0, since Ey(h(X)) = 0 for all 0 
= Ey (h(X)X) = 0 for all 6 


Also Eo(X) = 0 for all 6. Hence X is the unbiased estimator of 0 which is uncorre- 
lated with every unbiased estimator of 0. By the above result z is the UMVUE of 0. 


Example 2 : Let X be a random variable with p.m.f 


fo(x) = 0, if £= -—l1 


(de 4^ 09 if x = 0,1, 29° 


Suppose we want to find the parametric functions g(0) which admits of UMVUE and 
also to find such estimators. 


Let h(X) be an unbiased estimator of 0 i.e. 
Eo(h(X)) =0 for all 6 
=> óh(—1)4 Eh(2)(1 — 0)?6* =0 forall 0 
> Rte = CD 


a—0 (1-0)? 
Comparing the coefficients of 0”, x = 0,1,2,... from both sides of the above identity 


=— 3 h(- 106" for all 0 


we get 


h(x) = —zh(—1),z = 1,2,... and h(0) = 0 (1) 


Let T be the UMVUE of g(0) 


Eo(Th) = 0 for all 6 
Hence by a similar argument we get 
t(z)h(x) = —xt(—1)h(—1), 2 = 1,2,... and t(0)h(0) = 0 (2) 


3 


Comparing (1) and (2) we get 
t(x) = t(—1),x = 1,2,... and t(0) is arbitrary. 
T(X) is thus defined is the UMVUE of its expectation. 


Eo(T) = g(0) for all 0 
=> g(0) = (1 — 6)°t(0) + (-1)(1- 0 - 6] 
=> g(0) = & (1 — 8)? + co where co = t(—1) and c = t(0) — t(—1) 
Hence an estimable function g(0) will have the UMVUE if and if it is of the form 


g(0) = &(1— 0)? + co and the UMVUE is given by 


T(X) = otc if x = 0 


dy if X = P12... 


Note : If we take g(0) = 0, then it can not be put in the above form and hence there 
does not exist UMVUE of 0 though there exist an unbiased estimator of 0 which is 


given by 


T(X) 1 if X =-1 


OifX =0,1,.... 


Statistical Inference I 
Shirsendu Mukherjee 
Department of Statistics, Asutosh College, Kolkata, India. 


shirsendu.st @yahoo.co.in 


Some Results Uniformly Minimum Variance Unbiased Estimator 
Result 1 : A uniformly minimum variance unbiased estimator, if exists, is unique. 
Proof : If possible, suppose 7; and T5 be two UMVUE of g(0). 
Then Ej$(Ti) = E&(T;) and Vare(T,) = Vare(T2) for all 0 Deine a statistic T such 
that 

(Y! “(Ti + T3) 


Then Eo(T) = g(@) for all @ and 
1 
Varg(T) = 4l are(T:) + Varg(T5) + 2C ove(T1, T5)] (1) 


If pọ(Tı, T5) be the correlation coefficient between T, and T> then 


T 


Cove(Ti, T?) = po(Ti, T») / Varo(Ti Varo(T2) = po(Ti, T) Varo(Ti) 


From (1) we get 
Vars(T) = s lVaro(T;) + pe(T3, T3) Varo (T3) 
Since T; is an UMVUE of g(0) 
Varg(T1) € Varo(T) for all 0 
> p(T, T3) > 1 for all 0 
= pe(Ti, T3) — 1 for all 8 
=> P [Ti = a(0) + b(0)T5| = 1 for all 0, for some constants a(0), b(0) 
=> Eo(T\) = a(0) + b(0) Eo(T5) for all 0 


= g(0) = a(0) + b(0)g(8) for all 0 


From the above identity we get 

a(0) = 0 and b(0) = 1 for all 6. Hence P5 [T1 = To] = 1 for all 0, 

Result 2 : The correlation coefficient between the UMVUE and any other unbiased 
estimator of g(0) is always non-negative. 

Proof: Let T be the UMVUE of g(0) any T" be any other unbiased estimator of g(0). 
Then 


Eo(T) = Eo(T’) = g(0) for all 0 


Let us define a statistic h = T — T". 
Then Ej4(h) = 0 for all 0 and hence h is an unbiased estimator of 0 i.e. h € Up. 


By the method of covariance 
Cove(T, h) = 0 for all 0 


= Cove(T,T — T") = 0 for all 0 
= Coug(T, T") = Vare(T) for all 0 
The correlation coefficient between T and T" is given by 
T, T' T 
Po = Caa a = Vara ) for all 0 
(Vare(T)Vare(T") Vare(1") 


which is always non-negative. 


Note : The efficiency of T” w.r.t. to T is defined by 


Z Varo (T) 


Luo Fees 
VargT’) 9. 9 


€g 


From the above result we get 


po = Veo for all 8. 
k 
Result 3 : If T, T5... Tp are the UMVUE’s of g;(0),i = 1,2,... k, then X ajT; is the 
i=1 


k 
UUMVUE of Y aigi(0). 


Proof : Since E9(T;) = g;(0) for all i, we have 
k k 
Eo( X aiTi) = È aigi(9) for all 8. 


k k 
Hence X a;T; is an unbiased estimator of X ajg;(0). Since T; is the UMVUE of g;(0) 
i i=1 


i= 


for all i,by the method of covariance 


Eo(T;h) = 0 for all 0 and for all h € Up 
k 
= Eo( X aiTi-h)0 for all 0 and for all h € Uo 
k k 
Hence dail; is the UUMVUE of D aigi(9). 
k 
Result 4 : If Tj, T5... Ty are the UMVUE’s of g;(0),2 = 1,2,... k, then IT, is the 


UMVUE of its expectation. 
Proof: Let h € Uo. Since Tj is the UMVUE of g1 (0) by the method of covariance 


Eo(T\h) = 0 for all 0 
=> Th € Ug 

Again T5 is the UMVUE of g1(0) we get 

Eo(ToT,h) = 0 for all 0 

=> TyI1h € Ug 

Hence Eo;(T3T?T,h) = 0 for all 0. Proceeding in this way we finally get 

Ey(ILT,h) — 0 for all 6. 
Hence ÜT i is the UMVUE of its expectation. 
Note: ET is the UMVUE of g(0) then by the above result T^ is the UMVUE of its 
expectation where k is any positive integer. 


Result 5 : If the sample consists of n independent observations from the same distri- 


bution, then the UMVUE’s are symmetric in the observations. 


3 


Proof : Let X;,X5,...X, be a random sample of size n drawn from a popula- 
tion with probability function f(x) where 0 is an unknown parameter. Suppose 
T (X4, X2,... Xn) be an unbiased estimator of g(0) with Varg(l' (X1, X2,...Xn)) = 
c?.Let us consider a symmetric function of (X4, X5,... Xn) viz. 

Ite aX X; 
mutations of (44,49, ... in) of (1,2,...n). Then E&(T*) = g(0) for all 0 and Varg(T*) 


Xi... Xi,)) where the summation being taken over all possible per- 


is equal to 


X; 


1 1 
i XVare(T (Xi, Xi, "m X,))t - gECove(T(Xs, Po Eri Xin), T(X; je AX 5.))) 


Ji? 


Now by symmetry Varg(T(Xi, Xi,,... Xin) = o? for all 0, for any permutation of 


12? 


(41,49, ... În). Also by Cauchy-Schwarz inequality 


Cove(T (Xi, X5... Xin) T(X4, X5... Xp) € e? for all 0 


1? 


for any two differnent permutations (i1, i2, . . . în) and (71, 32, .. . 34). Hence from (1) we 


get 
on! + n!(n! — 1) 


ni? 


Vare(T*) < o — o? for all 0 


Hence 7* is the UMVUE of g(0). 
Result 6: Let T; and T, be two unbiased estimators of g(0) with efficiencies e and 


e> respectively then 


lo — verel € V (1 — e) — es) 
where p is the correlation coefficient between T; and T>. 


Proof: Let T be the UMVUE of g(0) with Varg(T) = o?. Then Vare(T;) = © for 


€i 


all 4 = 1,2. Let us define a statistic 7* such that 
T* =aT, + (1 — o)T», for any constant a € (0, 1) 


Then 
Eo(T*) = ag(0) + (1 — a)g(@) = g(0) for all 0 


Hence 7* is also an unbiased estimator of g(0) with 


Vare(T*) = o?Vare(T1) + (1 — a)?Vare(Te) + 2a(1 — o)Cove(T3, T2) 


= a’Vare(T,) + (1 — a)?Vare(T2) + 2a(1 — o)py Vars(T1) V ars(T3) 
1 1 11 
=¢0 |a — + (1—a)?— + 2a(1-—a)p m 
€1 €2 
Since T is the UMVUE of g(0) 


Varg(T) € Varg(T*) for all 8 


After some routine algebra we get 


ao? + ba 4- c 2 0, (1) 


From (1) it is clear that the discriminant of the quadratic equation aa? + bo +c = 0 


can not be positive, we have 


( p A A 1) 
4/ €1€2 €2 EVI €2 4/ €1€2 €2 


Note : Let T be the UMVUE and 7" be any other estimator of g(0) with efficiency 
e. If p is the correlation coefficient between T and 7" then from the above result we 
have 


l1—2e<p<1. 


Statistical Inference I 
Shirsendu Mukherjee 
Department of Statistics, Asutosh College, Kolkata, India. 


shirsendu_st @yahoo.co.in 


SUFFICIENCY 

Let Xi, X2,...X, be a random sample of size n from a population with p.m.f./p.d.f. 
fo(x), where 0 € © C R is an unknown parameter. A sufficient statistic for 0 is a 
statistic that, in a certain sense, captures all the information about 0 contained in 
the sample X = (Xi, X2,... Xn). In other words, if a statistic T(X) is sufficient for 
0 then after observing T it is not possible to get any additional information about 0 
from the sample. 

Definition A statistic T is said to be an sufficient statistic for 0 (or, rather, for the 
family of distributions Fy = { fọ(x) : 0 € ©} ) if the conditional distribution of any 
other statistic, say 7*, given T = t is independent of 0 for every possible values of t. 
Note : If we take T* = X = (X1, Xs,... Xn), then a statistic T is said to be a suff- 
cient statistic for 0 if, the conditional distribution of (X1, X5,... Xn) given T = t is 


independent of 0 for every possible values of t. 


Remark 1 In this definition the parameter 0 and the statistic T may be vector 
valued. 

Remark 2 If T is sufficient for the family Fy = (fo(x) : 0 € O} then T is sufficient 
for the family F*9 = (fo(x) : 0 € O*) where O* c ©. 

Remark 3 A sufficient statistic contains all information regarding 0 in the sense that 
from the knowledge of T' alone it is possible to generate (by a random mechanism) 
a random quantity Y which is completely equivalent to the original X i.e. the 
distribution of Y is same as that of X. Since X and Y have the same distribution 


for all 0, they provide exactly same information about 0. 


To justify this statement, suppose X is a discrete random variable. Then for any x 


P(X-z) = P(X-z,T(X) 2 T(z)) ,since (X 2 z} > (T(X) = T(z)) 


Consider the following examples. 

Example 1: Let X be a single observation from .N(0,6?) distribution, where o is 
unknown. Then, given |X| = t, the only two possible values of X are +t or —t, and 
by symmetry, the conditional probability of each is z. The conditional distribution 
of X given |X| = t is independent of o and T = |X| is sufficient for ø. 

Note : Given T = t we can generate a random variable Y which is equivalent to X 
by using the following pseudo random number generation technique: 


We toss a fair coin and define a random variable Y such that 


Y = + if a head appears 


= —t if a tail appears 


then the random variable Y is also a N(0,6?) variable. 


Example 2 : Let Xi, Xə be iid. P(A) random variables. Consider the statistic 

T = X, + Xə . Consider the conditional probability, 

P(X, = 21, X2 =t- zı) 
P(X + X» = t) 


= 0, otherwise. 


P(X = £1, X2 £T t) 


; ift = £1 + £92, 


Thus, for x; 20,1,2,...,2— 1,2, 24 + £2 = t, we have 


t 1\ć 
P(X, =a X= nS X20 = ( G) j 


X1 


2 


which is independent of A. Hence X, + X» is sufficient for A. 

Note : Given T = t we can generate a random vector (Yi, Y2) which is equivalent to 
(X1, X3) by using the following pseudo random number generation technique: 

Toss a fair coin t times and let Y; and Y> = t — Yi be respectively the number of 
heads and number of tails obtained in t tosses. The the joint distribution of (Y1, Y2) 
is the same as that of (X1, X2). 


Example 3: Let X4, X» be independently and identically distributed as Poisson vari- 

ables with parameter 0. Consider the statistic T = Xı + 2X». The possible values 

of T are 0,1,2,... We consider the conditional distribution of (X1, X2) given T = t 

where t = 0,1,2... 

P Xo = 1, Xə = 0| 
P\X; =0,X%9=0 
PIX,0,X5—0 

= i, 


P(X, =0, Xs — 0/7 = 0 


PIX 21, X = 0] 
P[X; 4 2X; = 1] 
PIX 21, X = 0] 
P[X, 2 1, X4 = 0] 
1. 


P[Xi — 1, Xə oT 1] 


PIX 2 0, X; = 1] 
P[X|4 2X2 9 
PIX 2 0, X; = 1] 
PIX 20, X = 1] + P[X = 2, X2 0] 


P(X, = 0, X» 1T 2] 


7 0 exp (—20) 

——. Qexp(—20) + B exp (—20) 
u 2 

> oe 


and P [X, = 2, X; = 0[T = 2] = 5%. 


Since the conditional distribution of (X1, X2) given T = 2 is depends on 0, T is not 


3 


sufficient for 0. 


Example 4: Let X;, Xə be independently and identically distributed as Bernoulli 
variables with parameter 0. Consider the statistic T = X4 + 2X». 
The possible values of T are 0, 1 and 2. 
{X,+ 2X2 =) = Ua — 0, X» =O}, if t =0 
<> {X =1, X2 =0}, ift=1 


e {X =0,X: = 1}, ift=2 


Hence we get 


QV = a XV OT 2-1 
b oS xs = op 1) = 1 
PX = 0, BA 1T -2] =1 
Since the conditional distribution of (X1, X2) given T = t is independent of 0 for 


every possible values of t, T' is sufficient for 0. 


Example 5: Let X4, X5 be independently and identically distributed with p.m.f. 


fn(z) = z 


yr fe=1,2,....N 


= 0, otherwise 
where N is unknown and N € {1,2,...}. Consider the statistic T = Maz(Xi, X»). 


The possible values of T are 1,2,... N. For any such mass point t we have 


P(T =1) = P(T <1) — P(T <t-1)= [^ — (t — 1)"] 


N^? 
The conditional distribution of (X1, X5) given T = t is 


P [Xi = 21, X2 z3|T t] 


if Maz(xı, z2) = t 


t^ — (t — 1)" 
= (Q,otherwise 


4 


Since the conditional distribution of (X4, X5) given T = t is independent of N for 
every possible values of t, T' is sufficient for N. 
But if we consider the statistic T} = Min(X,,X2), the possible values of Tj are 


1,2,... N. For any such mass point tı we have 


PA = epum SS Pa eo a tede 2. 


Nn 
The conditional distribution of (X1, X2) given T; = t4 is 


1 
(N=ti +1) -(N—t))* 
= 0, otherwise 


PIX = 21, X2 £T t] 


if Min(xı, x2) =t 


Since the conditional distribution of (X1, X2) given T; = tı depends of N , T; is not 
sufficient for N. 


Example 6 : Let X4, Xə be independently and identically distributed N (0,1) random 
variables. Consider two statistics T = X, + X» and T, = X, + 2X5. The joint distri- 
bution of (T, T1) is bivariate normal distribution with parameters (20,30, 2,5, JA 
The conditional distribution of T given T, = tı is normal with mean 1(3t — 0) and 
variance E whereas the conditional distribution of T; given T = t is normal with 
mean 3t and variance 1 Since the conditional distribution of T given Tj = t, de- 


pends on 0, T; is not sufficient for 0. As the conditional distribution of Tj given T = t 


is independent of 0, T' is sufficient for 0. 


Minimal sufficiency 
Definition A sufficient statistic T is said to be minimal sufficient statistic for 6 if, 
(i) If T is sufficient for 0. 
(ii) T' is a function of every other sufficient statistic. 
A minimal sufficient provides the greatest possible reduction of the data. Consider 


the following example: 


Example : Let X1, X2,...X, be a random sample of size n from N (0,1) population. 


The following statistics are sufficient for the parameter 0. 
Ti = (X1, aX) 


To = (X1 + Xo, Xa,... X4) 


T3 = CX + Xə + Xa, X4... X4) 


Ta = (X1 + Xa+... + Xn) 


Note that T; is a function of T; 4 for every i = 2,3,...n. Ta provides most thorough 


reduction of data and hence it is minimal sufficient for O. 


Statistical Inference I 
Shirsendu Mukherjee 
Department of Statistics, Asutosh College, Kolkata, India. 
shirsendu_st@yahoo.co.in 


We have seen that for any statistic T, if the conditional distribution of 
X = (Xi, Xo,...X,) given T = t is independent of 0 for every possible 
values of t then statistic T will be an sufficient statistic for 0. But in any 
problem there are infinitely many statistics and therefore it is a difficult task 
to find out which of these statistics is sufficient for the parameter 0. In the 
present module first we shall discuss a very simple method for determining 
a sufficient statistics. This method was to due to Fisher and Neyman and it 
is popularly known as Fisher-Neyman factorization theorem. 


Factorization theorem Let (X1, X5,... Xn) be an n-dimensional random 
vector with joint p.m.f/ p.d.f. fe(a1,22,...%,) where0 € O C R, is 
unknown. A statistic T(X 1, X2,... Xn) is sufficient for 0 if and only if 
fo(a1, £2, ... £n) can be factorized as 


folti, 22, 25) = ge(t(z1, Vo, .. . za))R(x1, x2, ... , En), (1) 


where gọ(t) depends on 0 and on (21, 25,... En) only through t(2, 5,. .. En) 
and h(z1, £2,..., £n) is independent of 0. Note: Factorization theorem also 
holds when 0 is a vector of parameters and T is a multidimensional random 
variable. In that case, we say that T' is jointly sufficient for 0. 


Proof: For the sake of simplicity we assume that (X1, X»,... Xn) be an 
n-dimensional discrete random vector with joint p.m.f. fo(z1,22,... x4). Let 
us write X = (X1,,X5,... Xn) and x = (z1, £2, . .. En). 


Only if part Suppose T be sufficient for 0. Since (X = x} 4> (T(X) = t} 
if T(x) — t, and we can write, 
fx) = R(X =x) = P(X =x,T(X) et. ifT(x)=t 
= P(T =tP(X =x|T =t), 


provided that P(X = x|T = t) is well defined. Since T is sufficient for 0, the 
conditional probability P(X = x|T = t) is independent of 0. Let us write 


hx) P(X =x|T =¢). 


1 


Setting, 
golt) = P(T = t). 


we get fa(x) = go(t)h(x). 


If part Suppose that (1) holds. Therefore for fixed t we have, 


PI =t) = 5 Po(X =x) 
x:T(x)=t 
- È sene) 


I 
Ka) 
ES 
P aie 

= 
xr 

e 
rdum 

ps 
xr 


Suppose that P(T = t) > 0. Then, 


PX =x, T — t) 


P(X =x|T = t) = P,(T — t) ’ 


which is 0 when T(X) Z t and AN when T'(X) = t Thus if TX — t, 


P(X—xT-5) ^. get)h(x) 


P(T =t) ge(t) 32x r2 h(x) 


which is independent of 0. Therefore T is sufficient for 0. 


Remark 1 If T is sufficient for 0 and T* = ¢(T) where ó(.) is an one to 
one function. Then 7™ is also sufficient for 0. 
Proof : Since ¢(.) is one to one, we can write T = à !(T*) = v(T*), say. By 
factorization theorem (only if part) we can write 


fo(x) 


ge(t)h(x 
ge(v (£^ ))h(x) 
= g'(t)h(x) say, where g*(.) = g(v.)). 


Hence factorization theorem (if part) , T* is sufficient statistic for 0. 

Note : Every function of a sufficient statistic may not be sufficient. For 
example, if X ~ N(0,1),0 € R then X is sufficient for 0 but X? is not. 
However, X is sufficient for 0?. 


Remark 2 Due to factorization theorem, a necessary and sufficient con- 
dition for a statistic T' to be sufficient is that for any fixed 0, and 62, the 
ratio fo,(x)/ fo, (x) is a function of T(x) only. 


Remark 3 If T} and T are two distinct sufficient statistics for 0, then T3 
is a function of T>. 
Proof: By factorization theorem, 


fe(x) = ge(ti)ha(x) = ge(ti)ha(x) 


which implies that T} is a function of T5. 


Remark 4 If (T1, Tz) is jointly sufficient for (01,05), then it does not fol- 
low that T; is sufficient for 0;, i = 1,2. 


Remark 5: Let X1, X5,..., Xn be a random sample of size n drawn from 
a continuous distribution with p.d.f. f(a) . Then the set of order statistics 
T = (Xm, Xo, ..., X(m)) is jointly sufficient for 0. Proof: The joint p.d.f. of 
T is given by 
ge(t) = n! IT, fo(xqy) 
Since (1), 2(2),..., (»)) is nothing but a permutation of (x1, £2, ... , En), we 
have 
go(t) — n! IT, folti) 
= folx) = golt) $ 
Setting h(x) = 4+, the remark follows from the factorization theorem. 


nl? 


Some examples: 


Example 1: Consider a set of n Bernoullian trials with probability of 
success p. With the i^^ trial we may associate a variable X; having the 
probability-mass function 


fel = »*(1 — p) **, for x; 20,1 
The joint probability-mass function of z4,25,...,r, is, 


Trias 9) (1 nc Tiam) 


f (£1, £2,- - , En|p) = p —p 
g(T (£1, £2, .. - RD PU £2,- . -, En), 


3 


where T'(1,25,..., 94) = 30%, Zi, g(T (a1, £2,...,2n)|p) = pTO) (1 — 
p)-TGi22.02)) and h(x1,22,...,2n) = 1. Hence T(21,2%2,...,2n), the to- 
tal number of successes out of n trials, is a sufficient statistics for p. So is 
T (21, £2, ..., 24,)/n, the average number of success. 


Example 2: Let £1, 25,..., £n ben independent random observations from 


a normal distribution with mean p is unknown and variance a? < oo is known. 
Then the joint density function of 71, £2,..., £n is 


f(zu za esum) = E. (- Xe n) 


1 n ?) ill =, 
= m cum ———(z-— ——À i— T 
su eo (^g - o) oo (- 2 - 2 
= g(T(zi22,.. MWh, 22... , En), 
where T'(z1,25,..., £n) = Xi; v; as well as Z is a sufficient statistic for ju. 
Example 3: Let 71, 22,...,®, be an independent random sample from a 


Poisson distribution with parameter A > 0 unknown and the corresponding 
probability mass function is 


—AJA*: 
Fax e) for x; = 0, 1,2,...,00. 
d. 
The joint density of z1,25,...,2, is 
exp(—nA)AT (riz, n) 


f (£1, £2,- , Enl) = IE 5] 
= SUD Wey s A AG Bag oo , En), 
where T'(zi,25,..., £n) = Ð; v; as well as Z is a sufficient statistic for A. 
Example 4: Let x1, z5,..., £n be independently and identically distributed 
as Uniform [0,0]. The corresponding density for each z; is, 
f(z|9) = 0 I(0 «Xx 0), 


I{A} is used to denote the indicator function of set A. The the joint density 
of £1, dig. coy diu, 16 


tists ss mH) = OTHO < x; <0 Vi=1,2,... n} 


4 


9 "I0 « T's Pu En) < 0} 
OL (£1, ds , En) CG dais es , En), 


where T'(z1,25,..., £n) = maxi z; is the sufficient statistic for 0. 
Example 5: Let Xj, X5,..., X, be n independent random observations 


from a normal distribution with mean p and variance o? < oo both unknown. 
Then the joint density function of X1, X5,..., Xn is 


1 1 : 
21,019,...,Z4|],0) = ———expi—z—2» (a4 - 
fenaa admo) = een (og otn - ar) 
1 4d WR ox uo Jb np 
— uie = Coe Tt; — — 
(V 2T)” c P ( 20? 2. MT gà 2. we) 


Here 0 = (u, o°) is a vector valued parameter. Therefore, T(X 1, X5,..., Xn) = 
(324 X5, 324 X2) is jointly sufficient statistic for jj and o°. 


Example 6: Let Xi, X5,..., X, be n independent random observations 
from U(0,0 + 1) where 0 unknown. Then the joint density function of 
Xi, Xa,.. 2 , XA is 

f(z1,22, € Ac, NOY = I (0< 21, 20,...,2n 0 4- 1) 
= I (0 € zo) < Xx) Sas ce Mond <6+1) 


= I(0< aq) <a £07 1). 


Therefore, T(X1, Xo,...,Xn) = (Xm Xn) is jointly sufficient statistic for 
0. 


Subject-Statistics 
Shirsendu Mukherjee 
Department of Statistics, Asutosh College, Kolkata, India. 
shirsendu. st Qàyahoo.co.in 


Exponential families 


The exponential family of distributions happens to be very rich when 
it comes to statistical modeling of data sets in practice. The distributions 
belonging to this family enjoy many interesting properties which often at- 
tract investigators toward specific members of this family in order to pursue 
statistical studies. In the present module we discuss briefly both the one- 
parameter and multi-parameter exponential family of distributions. Some of 
those properties and underlying data reduction principles, such as sufficiency 
or completeness, are discussed in subsequent modules. 


One-parameter exponential families 

Definition Let Fy = (fo(z),x € X,0 € O} be a family of probability 
density or mass functions. The parameter space O is an open subinterval of 
real line and the support ¥ = (x : fo(x) > 0} is independent of 0. We say 
that it is an one-parameter exponential family if and only if we can express 


fo(x) as, 
fo(x) = exp [a(8) T (x) + u(0) + v(x)], (1) 
for some real valued functions q(0) and u(@) of 0 and T(x) and v(x) of x. 


Let X1, X2,..., Xn be i.i.d. with density or mass function fg. Then the 
joint density of X1, X5,..., Xn is given by, 


gemi, ass dL) = Il folti) = exp [uo Y TG) +u(0) + Y] : 


i=] 


Hence, gg is also of the form (1). 


Some examples: 


Example 1: Let X ~ N(0,6?) where, standard deviation o? is unknown. 


Then, 
1 b T logg(2z0?) x? 
dog ^ ug] 9 2 252)" 


is a one-parameter exponential family with 


fe (x) ES 


(0?) =-55, TG)-45 u(o?)= 


E log(2707) 


5 ; Ux) eu. 


Example 2: Let X ~ N(u,1) where, mean p is unknown. Then, 


1 (x — uy logg27) z^ p’ 
Hult) = eee - (EH) = exp (- 2 — Zl 


is a one-parameter exponential family with 


i) =m The) aw ula) =~, a=- E + PAA, 


Example 3: Let X ~ P(X) where, A > 0 is unknown. Then the probability 
mass function is, 


P(X = x) = exp (CA = exp(—A + zlog(A) — log(z!)) , 


is a member of one-parameter exponential family with 


q(A) =log(A), T(x)-z, u(A)=-A, v(x) = log(zx!). 


Example 4: Let X ~ Uniform(0,0). Then the probability density func- 
tion is given by, 


1 
fe(z) = 5 i, 0« 2 «6 


. Since the support of f(x) depends on 0, it does not belong to the one- 
parameter exponential family. 


Example 5: Let X ~ N(0,0?),0 > 0. Then the probability density 
function is given by, 


falx) = : exp — (E) = exp Caw M | 7) 


20? 


Since fg(x) can not be expressed in the form (1), it does not belong to the 
one-parameter exponential family. 


Multi-parameter exponential families 

Now we would like to introduce the multi-parameter exponential families 
with k-parameters 0 = (01, 02,...9%). 
Definition Let Fg = ifg(r), x € X,0 € O} be a family of probability 
density or mass functions. The parameter space O is an open set of R? and 
the support & = (x : fg(x) > 0j is independent of 0. We say that it is a 
k-parameter exponential family if and only if we can express fg(x) as, 


fg(x) = exp 21 qi(8)T;(x) + u(8) + v()| , (2) 


for some real valued functions q;(0) and u;(0), i = 1,2,... k of 0 and T;(x), i = 
1,2,...k and v(x) of ge 


Some examples: 


Example 1: Let X ~ N(j,0?) when both u and c? are unknown. We 
have, 


fuor(@) = 


1 L? — 2u + pe? r? H liu 
Van (a) =o Lgs + sat — 5 [in tonne?) 


is a two-parameter exponential family with 


1 
aio?) = — 5. eo?) =5, Tie) =2, Tle) =a, 


Example 2: Let X ~ Beta(04,05) with both 0; and 05 unknown. We 
have, 


feo (x) = T9, 167 (I9) if, 0< x < 1050 50. 
1 2 
Then, 
foo (x) = TID exp|(0, — 1) log(x) + (0 — 1)log(1— z)]I (0 € x « 1], 


where, where / (A) is the indicator function for the set A. Therefore, Beta(0,, 05) 
belongs to a two-parameter exponential family with, 


qi(91, 62) = (04 —1), Q2(01, 02) = (05—1), Ti(x) = log(x), T»(z) = log(1—2), 


T (01). (0) 


Men) Tp Gab 


, ga {0 < a @vl}. 


Example 3 : Let X ~ Rect(0,,05) with both 6; and 05 unknown. We 
have, 
1 
fo, o, (x) = ae 6,77! «C dp x bə. 
Since the support of the distribution depends on (01, 02), the p.d.f. fo, e, (x) 
does not belong to two=parameter exponential family. 


Example 4: Let X ~ Exp(u, c) with both u and o unknown. We have, 


1 —(E-H) 
kt Hu —œ<u<%œ,0>0 


Since the support of the distribution depends on ju, the p.d.f. fo,6,(x) does 
not belong to two-parameter exponential family. 


In all the above examples we consider a single random variable X. But 
the concept of exponential family can also be extended to the joint distribu- 
tion of correlated random vectors of dimension > 2. For example, 


(i) If (X1, X2) ~ Na(t, 112, 02, 02, p), then the joint distribution of (X1, X2) 
belongs to the five-parameter exponential family. 


(ii) If (X1, Xo,..., Xp) ~ N, (p, X), then the joint distribution of (X1, X», ... 
belongs to the k-parameter exponential family where k = p+ pp) = Pets) 
Now we consider the following very important result. 


Result If X is belongs to exponential family, and has the form, 


fo(x) = exp [a(0)t() + u(8) + v(x)] 


then, T = t(x) also belongs to an exponential family. 
Proof. Let us define the family of distribution (discrete case) for T' by, 


fat) = fe) 


x:T(x)=t 
= Y exp[Q(8)T(z) + D(0) + S(a)| 


x:T(x)=t 


= exp[D(6) 5; exp[Q()T(2) + S()] 


xu:T(x)=t 
= exp[Q()t-- DO] Y, exp{S(x)} 
u:T(x)=t 


= exp[Q(0)t + D(6) + H(t)]. 


where H (x) = Y (2: €XP{9 (x) }. Hence by definition of exponential family 
T' also belongs to one. 


Statistical Inference I 
Shirsendu Mukherjee 
Department of Statistics, Asutosh College, Kolkata, India. 
shirsendu_st@yahoo.co.in 


Completeness 

Definition Consider a family of probability density or mass functions Fy = 
{ fo(a) : 09 € O}. Consider any real valued function v(r) having finite expec- 
tation. We say that the family is complete if, 


Ey(u(X))=0 VOEO 


implies that, 
Po(v(X) =0)=1 VOEO. 


The family Fg is boundedly complete if for any bounded function v (X) of 
X 
Ey(W(X))=0 VOEO 


implies that, 
Po(W(X) =0)=1 VOE9. 


Let T be a statistic and Fy = {fT (x): 0 € O} be the family of probability 
distributions of T. 

Definition The statistic T'(X) is said to be a complete (boundedly complete) 
if, the family of distribution of T is complete (boundedly complete)i.e. for 
any real valued function w(t) having finite expectation 


EYT) =0 VOEO 


implies that, 
PoW(T) =0)=1 VOEO. 


Implication A statistic T is complete means that there exists no non-trivial 
function of T which is an unbiased estimator of 0. 

Result 1 : If T is complete and 7* is a function of T, then T* is complete. 
Proof : Let T* = @(T) and v(T*) be any real valued function of t* 


E,(u(T*)) =0 Yoco 


=> Eyv(o(T)) -0 v6eo 


=> v(é(T)) =0 V9 6€ O8, since T is complete 
— v(T*)) 0 V0 € 98, since T is complete 


Hence 7* is complete. 

Result 2 : If T is complete then T is boundedly complete. 

Proof: If E(V(T)) 20 V0 € O implies that, P;,(v(T) 20 =1 VAEO 
for any function (T). Then it must also be true for the subset of bounded 
functions. 

Note : The converse of this result is not true. 


Counter example : Let X be a random variable with p.m.f 


fo(x) = 0, if x = —1 
+076, if £ = 0,1,2,...,0<0 < 1. 


Let Y(X) be any function of X. 


Eo(w(X)) — 0 for all 8 
=> p(—1)4 SEO — 9)?6 — 0 for all 0 


os MANO -1)00^ os i 
> MP u- - =- D v(-1):0" for all 0 
Comparing the coefficients of 07”, x = 0, 1, 2, . .. from both sides of the above 

identity we get 


p(x) = —rh(—1), for all x 


If y(—1) = c Æ 0, then y(x) = —cz, for all x, ie. y(x) is non-zero with 
non-zero probability. 

Hence the family of distributions Fg = { f(x) : O € O} is not complete. 
Again y(x) is bounded if and only if v(—1) = 0 in which case v(r) = 
0, for all x = 0,1,2.... Hence the family of distributions Fg = (fe(x):0 € 
O is complete. 


Some examples: 


Example 1: Let X ~ Binomial(n,0) random variables. Consider any 
real valued v(x) such that 


Es(u(X)) 20 V0 « 0 « 1, 


= Y (a) (") 6(1—8y-* —0 W<6<1, 


If we set À = is then 


n 


Y vt) (") X 20 VÀ € (0,00), 


x=0 


Which is a polynomial in À of degree n. The only way a polynomial of 
degree n can be zero for more than n distinct values of the variable is for all 
coefficients to be zero. That is, 


p(x) = 0 forz = 0,1,...,n 


Thus, we have shown that, if Ee(Y(X)) 20. V 0 € (0,1), then v(X) = 0 
with probability one. Hence the binomial complete is complete. 

Remark 1: If the parameter space O is not an open interval of real line, for ex- 
ample, if O = {1/3, 2/3} then the family of distributions of T is not complete. 


Remark 2: Let Xj, X»5,..., X, be iid. Bernoulli(@) random variables. 
Consider the statistic T'(X1, X5,..., Xn) = Y X ^ Binomial(n, 0). Then T 
i=1 
has a complete family of distributions and hence T is complete. 
The joint p;m.f. of (X1, X2,...,Xn) is given by 


Y x, 5 ; 
folti, £2, ..., Zn) HOR) (1 — 0)i=1, z; = 0,1 for all 251,2... 


The family of joint distributions of (X1, X2,..., Xn) is not complete since if 
we take (X4, Xo,...,Xn)) = X41 — X» then 


EY (X1, X2,..., X4)) = Eg(X1 — X3) —0 V 0 € (0, 1) 


but Po(Xy — X» F 0) >0. 


Example 2: Let X4, X9,..., Xn be ii.d with common p.m f. 


1 
Iu(x)- NS 1,2,... N, N is a positive integer 


Let Y(X) be any real valued function of X such that 


Ew(v(X)) 20 forall N>1 
l W(X) =0 forall N21 
Nw EU00 = 0 fora 2e 


and this happens if and only if y(x) = 0 for all x = 1,2,... N. Hence the 
family Fy = (fw(x) : N > 1). 
Now suppose that we exclude the value N = no for some fixed no > 1 from the 
family Fy = (fw(x) : N > 1). Let us write Fẹ = (f(x): N>1,N zZ no}. 
The family Fx is not complete. To show this we consider the function 
(xX) = 0, if x— L2,...,no — l,no 4-2,no t 3,... 
= c dde 

—c if gene $1, 
where c is a constant,c Æ 0. 
Then Ey(¢(X)) = 0 for all N > 1, N Z no but function Py(@(X) 20) Z1 
. Hence the family Fx is not complete. 
Note: This example shows that the exclusion of even one member from the 


family Fy destroys completeness. 
Example 3 : Let X ~ N(0,1). Consider any real valued (a) such that 


Eo((X)) 20 WER 
=> a (a) e 126-9? —0Vv8 
—oo V 2T 
oo a) 
= 7 v(x)e 2 edz —0 V 0 
EN 
Since integral is a bilinear laplace transform of v(x)e 7^, we get 


—g2 


v(r)e? =0 Vr 


=> y(x) = 0 Vr 


4 


Hence N (0, 1) family is complete. 
Example 4 : Let X ~ N(0,0?). If we take w(x) = x then 


Eg(u(X)) 20 V0 >0 


but the function Y(X) 4 0 with probability one.Hence N(0,0?) family is 
complete. 

Example 5: Let X;,X5,..., X, be n ii.d. random sample from Normal 
distribution with mean 0, and the variance c unknown. Here the statistic 
T =, X? ~ oy? and the density function is given by, 


Jolt) = ( : 


n —(t/2oytl-9/ 0<t< oa. 
2yr (n/a) XP - (526) dii 


Now for any measurable function w(t), 


1 


FH 7 Tay PTa 


n = w(t) exp A(t/20)t-9/2dt = 0 Yo € (0,00), 


and on using Laplace transforms, it follows that, v(t)t(^-2/? = 0 and i.e. 
w(t) = 0 almost everywhere. Hence T is a complete statistic for c. 
Example 6: Let X4, X5,..., Xn ben i.i.d. random observations from a uni- 
form distribution U (0,0). The density function for each X; is 


1 
folti) = g £0 mi «9 
= 0 


Here T = X(n), the largest order statistics is a sufficient statistic. More- 
over we will show it's completeness. Let us denote the probability density 
function of T by ge(t) and is given by, 


ge(t) 


ntt if 0 c t «0 
= Vow. 
Now, for a measurable function Y(T), 


n 


0 
= mh v(t)f-!dt =0 V6 € (0,00), 
0 


Es (v (T)) 


[ v(t)t^ dt =0 V0 e (0,00) 


5 


Differentiating both sides with respect to 0 gives, (0) = 0, V0 € (0,00) and 
hence w(t) = 0, Vt € (0,00). Since (0,0) C (0,00), we get w(t) = 0, Vt € 
(0,0). Hence T = Xin) is a complete statistic. 


Example 7: Let Xj, X5,..., X, bem i.i.d. random sample from Normal 
distribution with mean u, and variance c both unknown. We know that 
T = (Tı, T5), where 

= VnX, and Tə = SX - X} 


i=1 


is a sufficient statistics for 0 = (u,o?). Now the joint density function of 
T, T^ is 


A p(r-8)/2 


(tı — to 
folti, t2) = C exp | oo 2 exp -z| . 
Suppose now that the function v(T') is such that 


Des ]* (trys) fa(tr,to)dtydty = 0, for all 6 € 6, 


1 TEA 
F [ w(t, t2) exp E + ta) + d iU"? dr dt, = 0, for all 0 € 9. 


Changing the variable t to u = t? + te, then this reduces to 


D ] "é tı, u) exp E + v J E(tı, u)dudt, = 0, for all 0 € ©. 
ti 

where [t1, ulti, t2)| = Y(tiı-t2) and €(t,,u) = (u — 12)"-97 £ 0 for u > t. 

If we now extend the range of €(t;, u) by taking €(t,,u) = 0 for 0 < u < #?, 

then the above identity becomes 


oo oo t 

L dts, u) exp -zs -+ vine : £(t1, u)dudt, = 0, for all 0 € ©. 
oo oO oO 

Let us write ;L, = 6, and yy = 05, so that the above is seen to be a 


Laplace transform. From the uniqueness property of the Laplace transform, 
it follows that 


Q(t1, u) = 0 for almost all (tı, u), u > 0, 


6 


1.8. 
V (t1, t2) = 0 for almost all (t1, t2), t2 > 0, 


Hence T = (T1, T5) has a complete family of distribution. 


On other hand when we will be needing to show that a statistic T' is not 
complete statistic for some parameter 0, all we will need is to search for some 
measurable function v*(t) such that 


E9("(T)) =0 V6ee, 


but v*(t) Z 0 almost everywhere. Following examples will clear this ap- 
proach. 


Example 8: Let X4, X5,..., X, bem i.i.d. random sample from N(0, 0?). 


Then T = (X X, 0%, X2) is jointly sufficient for 0. However T is not 
complete since, 


Ep 


TL 2 TL 
($x) SDE =0 forall 0 
i=1 


i-l 
but, Y*(T) = 2 (92? , Xj)? — (n + 1) X? X? is not identically zero. 


Example 9: Let X1, X5,..., Xn ben ii.d. random sample from U (0, 04-1). 
Then T = (Xa), Xin) is jointly sufficient for 6. However T is not complete 
since, 


Eg Xt = Xa) = a =0 forall d 
but, Y* (T) = Xt) — Xa) — S is not identically zero. 


Minimal sufficiency 
Definition A sufficient statistic T is said to be minimal if all sufficient statistic 
it provides the greatest possible reduction of the data, that is, if for any other 
sufficient statistic U there exist a function H such that T = H(U) almost 
everywhere. 


Statistical Inference I 
Shirsendu Mukherjee 
Department of Statistics, Asutosh College, Kolkata, India. 
shirsendu_st@yahoo.co.in 


Complete Sufficient Statistic 

Definition A statistic T is called complete sufficient for an unknown param- 
eter 0 if and only if T' is sufficient for 0 and T' is complete. 

Now we shall consider the problem of determining complete sufficient statis- 
tic for exponential family of distributions. In this connection let us consider 
the following results.First we consider the case of an one parameter exponen- 
tial family. 

Result 1 : If be the p.m.f. or p.d.f. of a random variable X is of the form 


fo(x) = exp [a(0)t(2) + u(8) + v(x)], 


then the statistic T'(.X) is complete sufficient for 0. 
Proof : We can factorize fo(x) as 


folz) = ge(t)h(x), 
where ge(t) = u(0) exp [q(A)t(x)] and h(x) = exp [v(x)]. 
Hence by factorization theorem T is sufficient for 0. 


To prove the completeness without loss of generality we assume that fg(x) is 
continuous in R. Now the p.d.f. of T can be written as 


f*(t) = ezp[u(£)] exp [a(A)4(2)] AC) 
Let (T) be any real valued function of T such that 
Eo(w(T)) = 0, for all 0 
=> exp[u(0)] [. w(t) exp [q(0)t(x)] h*(t) = 0, for all 0 
Since the integral is of the Laplace transform type, it follows that 
v(t)h*(t) — 0, for alt € R 

Since h*(t) Z 0 for all t we get 

v(T) = 0 with probability one 


1 


Hence T' is complete. 
Note : X4, X»,..., Xn be n be a random sample from the above population 


fo(x), then zn is complete sufficient for 6. 


Let us consider some FL 
Example 1: If Xi, Xo,..., Xn be n be a random sample from P(0) popula- 


tion b X; is complete sufficient for 0. 
[NE > If XQ, X2,..., Xn be n be a random sample from N B(r,0) 
population, where r is known and 0 € (0,1) is unknown, XX is complete 


sufficient for 0. 
Example 3 : If Xj, X5,..., Xn be n be a random sample from Beta(0, 1) 


population 1 X; is pomis sufficient for 0. 

Example 4: X35. ...,;Xp be n be a random sample from N(0,1) popu- 
lation E X; is ae sufficient for 0. 

Example 5: f X, Xo,..., Xn be n be a random sample from N(0, 6?) popu- 
lation x X? is complete sufficient for 0. 


In order to determine complete sufficient statistic for a k-parameter expo- 
nential family we consider the following result. 
Result 2 : If be the p.m.f. or p.d.f. of a random variable X is of the form 


f(x) = exp 2 %( + u(0) + v(x)*, 


then the statistic (T3, T2,- . . Tk) is complete sufficient for (61,05, . . . Ok). 


Now we discuss some examples of determining complete sufficient statistic 
where the distribution does not belong to the exponential family. 


Example 1: Let X1, X5,..., Xn ben be a random sample from the population 
with p.m.f. 


1 
fu(z)- wee 1,2,... N, N(2 1) is a positive integer. 


The joint p.m.f. of X1, X2,..., Xn is given by 


Ja Pi; £2,- -- En) — Nn’ if 1 = X (m) <N 


2 


Let us define a function 


d(a,b) = lifa<b 
= Oifa>b 


then fw(zri,z2,...r4) can be written as 


1 
fgl, T2,.. EM = pa Pen), N) 


If we write t = Xm), gn(t) = xx ó(t, N) and h(z1,32,... Zn) = 1, then 


(ti tat =n So, Nh 22, ME 


By factorization theorem, T = X(n) is sufficient for N. 
The p.m.f. of T is given by 


RO = M Iro ug 


Let (T) be any real valued function of T such that 


En(W(T)) —0, for allN > 1 


1 
x » (t) [f = (€ — 1)"]0, for allN > 1 (1) 


t= 


Since (1) holds for all values of N > 1, setting N = 1 in (1) we get 
y) — 0 
Setting N — 2 in (1) we get 
(1) + v(2)2" — 1) = Obe 
=> p(2) =0 
Proceeding in this way in can be shown that 


wit) = 0 for allt = 1,2,... N 


Hence T is complete sufficient for N. 


Example 2: Let X1, X»,..., X, be n be a random sample from R(0;,05). So 
the density for each of X; 


1 
= — —  if8,<ax< Oo. 
fo, o, (x) (05 — ,)"" ll Ui & X 9 U9 


Let 0 = (01,05), X = min[X;, X5,..., Xn] and Xin) = max [X1, X»,..., Xn]. 
The joint p.d.f. of X4, X5,..., X, is given by 


1 
folti, 2... 32) = 55 go fA E 2a) < Tn) ) € 6» 
2—6i 


If we define a function ¢(a, b) such that 


o(a,b) = 1, ifa <b 
0, ifa>b 


then fg(x1,%2,...%n) can be written as 
1 
fg(zi, 22, e : In) = — —- (01, &(1)) 0 (05, X (ny) 
05 — 0, 


If we write ti = T1) to = T (m); gg (ta, ta) = 3:3, (91, t1) P(42, t5) and 
h(£1, £2, ... En) = 1 then fg(zi, 12,... 4) can be factorized as 


fg (x1, 22, Pes ia) 7 gg (ta, to)h(x1, T2,--- Tn). 


Then by factorization theorem T = (Xa), Xín)) is jointly sufficient for 0 = 
(01, 05). 

We shall prove show that Xa) and Xn) are jointly complete. Now the joint 
density of T = (Xa); Xín)) is 

n(n — 1) 


(& 8) 7 5) 7,6 €t < ta < 6. 
2^ Vi 


foroa (ti, 13) = 
Let Y(T) be such that 
05 t2 
Es (s (T)) = f f (ts, t2) fo(t1, to)dtydty = 0, for all 0 € O, 


i.e., 


0: t 
n | n "dts, te) (ts — 14)? — 0, for all 04 < 65. 
01 01 


4 


Differentiating both sides with respect to 05, by Leibnitz theorem we get 


05 
n v(ti, 05) (t2 — B)? = 0, for all 01 « 05, 


1 


Again differentiating with respect to 01, we get (01, 62)(02 — 0,)""? = 0 for 
all 0; < 65. Hence, w(t, t2) = 0 for tı < t». So, Xa), X(m) is jointly complete 
for U (61, 05). 


Some remarks on complete sufficient statistic 
Remark 1 : A sufficient statistic may not be complete. 


Example: Let X1, X2,...,X; be a random sample of size n from N(0, 0?),0 > 
0. Then by factorization theorem, T = (*27 4 Xj, TL, X2) is jointly sufficient 
for 0. If we write T, = 17? , X; and T5 = Y? 4, X? then 


Ty ~ N(n0,n0?) 


Eg(Ty) 2 n0, Varg(T1) = n8? 


Hence 
Eg(T?) —n6? + n?g? 


Also Fo(T2) = iL, Ee(X;) = hh, [E2(X; + Varg(X;))] = 2n? 
If we define a function of (T1, T5) as Y(Ti, T3) = 2T? — (n + 1)T5 then 


Eo [v Cr, T3)] — 0 forall ð >0 


But v (T1, T5) # 0 with probability one. Hence T = (T1, T5) is not complete. 
Remark 2 : A complete statistic may not be sufficient. 


Example : Let Xi, X5,..., Xn be n i.i.d. random sample from P(@) popula- 
tion. 

Since X, ~ P(0). by Result 1, X, is a complete statistic but X, is not suffi- 
cient for 0. 


Statistical Inference I 
Shirsendu Mukherjee 
Department of Statistics, Asutosh College, Kolkata, India. 


shirsendu_st @yahoo.co.in 


Ancillarity 

The concept called ancillarity of a statistic is perhaps the furthest away from suffi- 
ciency. A sufficient statistic T'(X) preserves all the information about 0 contained in 
the data X .To contrast, an ancillary statistic V(X) by itself provides no information 
about the unknown parameter 0. Individually. an ancillary statistic would not pro- 
vide any information about 0, but such statistics can play useful roles in statistical 
methodology. The term ancillary statistic (from the Latin ancilla, meaning hand- 
maiden) was introduced by R. A. Fisher (1925) in the context of maximum likelihood 
estimation. 

Definition A statistic V(.X) whose distribution does not depend on the parameter 
0 is called ancillary for 0. 


Let us consider the following examples 


Example 1 Let X; ~ R(0,0 +1), i = 1,2...n independently where —oo < 0 < 
oo. Let Xa) < Xa) < ... < Xin) be the order statistics from the sample. The 
sample range is given by R = Xn) — Xq) is ancillary for 0. To prove this let us define 
Zi = Xj—0,1 — 1,2...n. Then Z; ~ R(0,1). à 2 1,2...n and hence the distribution 
of Z; is independent of 0. If Zaj < Za) < ... < Z(,) be the order statistics for Z;'s 
then R = Xi) — Xa) = Zn) — Za). Since the distribution of Z; is independent of 6, 
the distribution of R is also independent of 0. 


Example 2 Let X; ~ R(0,0), i = 1,2...n independently where 0 < 0 < co. Let 
Xa € Xo) € ... € Xin) be the order statistics from the sample. Then the statistic 


(n) i = 1,2...n. Then 


V = xe is ancillary for 0. To prove this let us define Z; = 


Xi 
^ 


Zi ~ R(0,1). i = L,2...n and hence the distribution of Z; is independent of 0.If 


. . $ — X n E Zn) . 
Za) € Z € ... € Zm) be the order statistics for Z;'s then V = xo — Since 


the distribution of Z; is independent of 0, the distribution of V is also independent of 0. 


Example 3 Let X; ~ N(0,1), i = 1,2...n independently where —oo < 0 < oo. 
Then pre’ — XY? and XX, — X| are ancillary statistics for 0, where X — 2x2 
To prove this let us define Z; = X; —0 and W; = X; — X, i = 1,2...n. Then 
Zi ~ N(0,1). i = 1,2...n and Z = X —0 ~ N(0, 1), both have the distribution 
independent of 0. As W; = Z; — Z 


N 


, i= 1,2...n, the distribution of W; is also inde- 


pendent of 0. Hence x W? and 5 |W;| are ancillary statistics for 0. 
i=1 i=1 


Example 4 Let X and Y be independently distributed random variables such 
that X follows exponential with mean 0 and Y follows exponential with mean I re- 
spectively. Then the distributions of Z — M and W = 0Y are independent of 0. If we 
define V = XY = ZW, the distribution of V is independent of 0. Thus V is ancillary 


for 0. 


Example 5 Let X denotes the number of points obtained in throwing a biased 


die and the probability distribution of X be given by 


Bx ede S0 RE =2) = 5 — 6), P(X =3) = SG — $), 
B aj E jo) mese 502 +6), B(X =6) = E +6). 


If we define a random variable V such that 


V=0 ifX =lor4 


V=1 if X =2or5 


V=2 if X =3o0r6 


the distribution of V is independent of 0 and V is ancillary for 0. 


Example 6 Let X be a random variable with p.d.f. 
folz) =1, if0<2< 041,050 


If we define V = X — [X], where [X] denotes the greatest integer contained X, then 
V has R(0,1) distribution. Since the the distribution of V is independent of 0 the 


statistic V is ancillary for 0. 


Next we consider a very nice theorem which was due to Basu (1955). 

Basu's Theorem 

If T is complete sufficient for 0 and V is ancillary for 0, then V is independent of T. 
Proof For simplicity, we prove the theorem only in the discrete case. Suppose that 
the domain spaces of V and T are respectively denoted by V and 7 respectively. 
Let for any v € V, we write h(v) = Po(V = v). Obviously h(v) is free from 0 since V 
is ancillary for 0. 

Since T' is sufficient for 0 the conditional distribution of V given T' — t is independent 
of 0 for allt € T. 

Let us write g(t) = P(V =v|T = t), t€ T Then 


E«(g(T) = DPV =T —-t)P&(T =t) 


teT 

= SPV =v,T =t) 
tET 

= Po(V = v) 

= dm) 


It follows that E&(g(T) — h(v)) = 0 for all 0 € 8. 


Since T is complete we get 


g(t) 2 h(v) fr alv € V,teT. 


Hence V and T' are independently distributed. 
Applications of Basu's theorem 
Basu's theorem can be used for proving the independence between two statistics. 


Some examples are given below. 


Application 1 In Example 2, T = X(n) is complete sufficient for 0 and it is dis- 


X(n) 


tributed independently of V 2 V — Xo 


Application 2 


Statistical Inference I 
Shirsendu Mukherjee 
Department of Statistics, Asutosh College, Kolkata, India. 
shirsendu_st@yahoo.co.in 


In this module we are going to discuss to important theorems in statistical 
inference. The first theorem is on the minimal sufficiency and completeness 
of a family of distributions and the second theorem is on the improvement 
of an unbiased estimator through sufficiency. 

Theorem I If a minimal sufficient statistic exists for a family Fp = { fo(x) : 
0 € O}, then a complete sufficient statistic is minimal sufficient. 

Proof Let T* be a minimal sufficient statistic and T be a complete sufficient 
statistic for a parameter 0. We have to show that T and T* are equivalent. 
As T* is minimal sufficient statistic it is a function of every other sufficient 
statistic. 

Hence T* = h(T), a function of T. 

Let us now define, ¢(T) = T — E(T|T*). Since T* is sufficient, E(T|T"*) is 
independent of 0 and hence @(T) is a function of T only. 


E(9(T)) = Elf — E(T[T*)] = E(T) — ET) = 0 for all 6 
=> (T) — 0 ae. w.r.t..Fs as T is complete 
=> T = E(T|T*) = h*(1*) ae. w.r.t.Fo 


i.e. T is a function of 7* and hence T and T* are equivalent. 

Remark The converse of the theorem may not be true i.e.a minimal sufficient 
statistic may not be complete. 

Counter example Let X be a random variable with p.m.f 


fo(x) = 0, if r—-1 
(19)0*, if x = 0,1,2,...,0<0 < 1. 


Clearly X is minimal sufficient for 0. To check the completeness of the family 
let us consider a function (X) of X such that 


Eo(v(X)) 20 for all 0 


=> 6y)(—1) + PEON — 0)?0* =0 forall 6 


ui m u- - — E v(-1)r0" for all 0 
Comparing the coefficients of 0^, x = 0,1,2,... from both sides of the above 
identity we get 


w(x) = —xh(-1), for all x 


If (—1) = c Æ 0, then y(x) = —cz, for all x, ie. y(x) is non-zero with 
non-zero probability. 
Hence the family of distributions Fg = { f(x) : 0 € ©} is not complete. 


Theorem II Let the statistic U = U(Xi, X2,...,Xn) be an unbiased es- 
timator of g(0) while T is a sufficient statistic for 0. Consider the function 
(T) of T such that, 

o(t) = BUT — 9). 


Then ¢(T) is itself a unbiased estimator of g(0); moreover, 
Vare(o(T)) € Varg(U) 


Proof Since T is sufficient for 0, the conditional expectation of U given T = t 
can not depend upon the unknown parameter 0. Hence ó(t) is a function of 
t and it is free from 0. In other words, ¢(T) is a real valued statistic and so 
we can call it an estimator. Using a theorem on conditional expectation (see 
Rohatgi and Saleh) we get 


Eo(U) = E«(E(U|T)) = Es(o(T)), 


E Ejé(T)-9(0, VOEO. 


Hence ¢(T) is an unbiased estimator of g(0). By a theorem on conditional 
variance (see Rohatgi and Saleh) we get 
Varg(U) = Ese|Var(U|T)] + Vare|E(U |T)] 
= Eg|Var(U|T)] + Varo[(T)]. 


Since Ej|Var(U|T)| > 0 we have 


Varg(U) > Vare|o(T)], for all 0 € © 


Equality sign holds if 
Eg|Var(U|T)| = 0, for all 8 


=> Eo[E{U — E(U|T)} |T] = 0, for all 6 
> E,{U — $(T))? — 0, for all 0 
zb PU = p(T )) = 1, for all 8. 


Note In literature the above theorem is known as Rao-Blackwell theorem. 
The technique to improve upon an initial unbiased estimator of g(0) is cus- 
tomarily referred to as the Rao-Blackwellization in the statistical literature. 
One attractive feature of the Rao-Blackwell Theorem is that there is no need 
to guess the functional form of the final unbiased estimator of g(@). 
Implication Since an unbiased estimator based on a sufficient statistic T' 
will always has a smaller variance than one which is not based on T', the 
search for an UMVUE may be restricted to those unbiased estimators which 
are based on the sufficient statistic T. 

Some examples: 

Example 1 Let X;, X5,..., X, be a random sample from a Bin(1,0) pop- 
ulation with p.m.f. 


fo(v) = 6*(1 20), = 0,1. 


If we consider the function g(0) = 0 then X is an unbiased estimator of g(0). 
Our object is to find the improved unbiased estimator for 4(0) = 0 based on 
the sufficient statistic. Note that, T = *77., X; is sufficient for 0. Consider 
the statistic ¢(T) such that 


ot)  EGG|T 2 = P |xi =1 2X e 


i-l 
Then by definition of conditional probability we get 


PX 35a X 
P D Xi = t| 
Pixel 3A 4-1) 
PI ici Xi =t] 
P[X; = 1]P £X; 2 t — 1] 
P y» Xi = t 


e(t) = 


3 


(1 u 0) (cp eeu u gjn-i-tü 


Slc 


Therefore, (T) = Ly X; Rao-Blackwellized version of the initial unbi- 


ased estimator Xj. 


Example 2 Let Xj, Xo,.. 


with p.m.f. 


Consider the function 


g(0) 


Let us define a random variable U such that, 


Then 
Eo(U) 


NO beers 


q! 
0 otherwise 


PIX 


k] 


U = 


exp(—0)0* 


k! 


life =k 


0 otherwise 


1x PX =k] +0 x P[X £k] 
g(0) for all 0 > 0, 


.,X, be a random sample from a Poisson(0) 


(1) 


so that U is an unbiased of g(0). Now we know that T = >>, X; is a sufficient 


statistic for 0. Consider then the new statistic ¢(T) such that, 


o(t) - E(YIr =t) = 


PIX, =kIT =t] 
PIX, = ky ca Xi] 


exp(-6)8* exp(-(n — 1)8){(n = 10) 


PT =t 


/ 


exp[-n6] (n9)' 


| 


n 


k 


k! 


(n — rt 


n 


t 


(t= k)! 


t! 


since T has the Poisson distribution with parameter n0 while 577, X; has 
the Poisson distribution with parameter (n — 1)0. Thus 9(T) = a hc a is 
an improved unbiased estimator of g(0) with a smaller variance of the initial 
unbiased estimator U of g(0). 
Example 3 Let Xi, X»,..., Xn be a random sample from a N(0,1) with 
p.d.f. 
= 1 1 0 2 

folz) = m E ) | — oo « t « oo. 
If we consider the function g(0) = 0 then X is an unbiased estimator of g(0). 
Our object is to find the improved unbiased estimator for ?4(0) = 0 based on 
the sufficient statistic. Note that, T = 157? , X; is sufficient for 0. Consider 
the statistic ¢(T) such that 


o(t) = EXT = t) 


Since the conditional distribution of X, given T = t is N(t,1— +) it follows 
that E(X;|T = t) =t. Hence Rao-Blackwellized version of X, is T. 
Example 4 Let X, X5,..., Xn be a random sample from a HRect(0, 0) dis- 
tribution with p.d.f. 


1 
folz) = "RSS 


= 0 otherwise 


Consider the function 


Since Eg(X1) = £ for all 0 > 0, we consider U = X; as an initial unbiased 


estimator of g(0). Here the statistic T = Maz(Xi, X2,..., Xn) is sufficient 
for 0. To improve our initial estimator let us define the statistic 


$T) = E(UI|T). 


To find (T) note that if T = t, then X, = t with probability +, and X, is 


uniformly distributed over (0,1) with remaining probability (1 — 1) .Hence 
t 


1 
E(Xi|T =t) =t-+-=(1- 
aurae test 


-) n+1t 


nj n 2 


5 


Thus ¢(T) = ==? is Rao-Blackwellized version of the initial unbiased esti- 
mator U. 


Subject-Statistics 
Paper- Determination of UMVUE by Complete 
Sufficient Statistic 
Shirsendu Mukherjee 
Department of Statistics, Asutosh College, Kolkata, India. 
shirsendu_st@yahoo.co.in 


If the Rao-Blackwellized version in the end always comes up with the same 
refined estimator regardless of which unbiased estimator of g(@) one initially 
starts with, then of course one has found the UMVUE for g(@). Lehmann and 
Scheffe.s (1950) notion of a complete statistic, introduced in Module-9, plays 
a major role in this area. We first prove the following result from Lehmann 
and Scheffe (1950). 


Lehmann-Scheffe Theorem I Suppose that U is an unbiased estimator 
of a real valued parametric function g(0) where the unknown parameter 0 € 
OCR. Let T be a complete sufficient statistic for 0. If we define a statistic 
(T) = E(U|T), then ¢(T) is the unique (i.e. w.p. 1) of g(6). 

Note: In this theorem the parameter 6 may be vector valued and so is the 
complete sufficient statistic T. 

Proof The Rao-Blackwell theorem assures us that in order to search for 
the UMVUE of g(0) we need only to focus on unbiased estimators which 
are functions of sufficient statistic T alone. We already know that ¢(T) is a 
function of T' and and it is an unbiased estimator of g(0). Suppose that there 
is another unbised estimator $*(T') of g(0) where ¢*(T) is also a function of 
T. 

If we define A(T) = (T) — ¢*(T), then 


Eo(h(T)) = g(0) — g(0) = 0 for all 0 (1) 


Now, we use the definition of the completeness of a statistic. Since T' is a 
complete statistic, from (1) it follows that A(T) = 0 w.p. 1, that is we must 
have ó(T) = ¢*(T) w.p. 1. The result then follows. 


Note : In order to find the UMVUE of g(0), we need not always have to 
go through the conditioning with respect to the complete sufficient statistic 
T. In some problems, the following alternate and yet equivalent result may 
be more directly applicable. We state the result without proving it. Its proof 


can be easily constructed from the proof of Lehmann-Scheffe Theorem I. 
Lehmann-Scheffe Theorem II Suppose that T be a complete sufficient 
statistic for QíinO. If Y(T) is a unbiased estimator of the of a real valued 


parametric function g(0), then v(T) is the unique (i.e. w.p. 1) of g(0). 


Let us consider some examples. 


Example 1 Let X;, X»,..., Xn bea random sample from a Poisson distri- 
bution with probability mass function, 
—0)0* 
fo(x) = ening ) it 90,1. s 


q! 
= 0 otherwise 


We are interested in finding the UMVUE of g(0) = e? = Ps(X, = 0). We 
know that, fọ(x) belongs to the exponential family. Therefore T = $77, X; 
is a complete sufficient statistic. Consider the following statistic, 


U 1if Xı=0 


= 0 otherwise 


Then we get, Ej[U] = Ps[X1 = 0] = exp(—@). Hence U is an unbiased 
estimator of g(0). If we define ¢(T) = E(U|T) then, 


PIX;055 Geet 

P » Xi — t| 
PLE 60,5 554 X; = t] 

P psum Xi = t| 
P[X1 = 0JP 5755 X; = t] 

P [oa Xi = t| 
exp(—0) exp(—(n — 1)0)[(n — 1)0]'/t! 
exp(— (n)8)[(n)8]' /t! 


B (* — J 
— s 
za Xi 
Hence,by Lehmann-Scheffe theorem, ¢(T) = (=) =! is the UMVUE of 
e^? for n > 1. 


e(t) = P [X = 0[T — t] 


Example 2: Let X4, X»,..., X, be a random sample from N (u, 6o?) where 
u € R and o € R* are unknown. Suppose we want to find the UMVUE 
of u°. If we define 0 = (u,0?) then 0 € © = R x R^. Here the statis- 
tic T = (X, S?) is complete sufficient for 0, where X = Sum X; and 
S? = iiy (Xi - X). 

Since X ~ N(u, et). we have Eg(X?) = p? + a" 

As ice ~ x2 ,, we have E9(S?) = o?. Hence we get E(X? — £) = 
u?, for all 0 Since X? — z is an UE of u? based on the complete sufficient 
T, by Lehmann-Scheffe theorem X? — s is the UMVUE of 42. 


Example 3: Let X, X5,..., Xn be a random sample from a Rectangular 
distribution R(0,0) where 0 « 0 « oo is unknown. Our problem is to find 
the UMVUE for 0* where a is a known positive real number. 
The complete sufficient statistic for 0 is T = Xm) = Maz(Xi, X»,..., X). 
The p.d.f. of Xin) is given by 

n 


fo(t) = gaty » t d 


Now, 
n 


0 
ETHE z f pingi = 6^. for alld > 0, 


n+a 
n+a 


= Eo ( T") = 6", iossllé SO 


Hence ztapea is an UE of 0“ based on a complete sufficient statistic T. 
Hence,by Lehmann-Scheffe theorem, it is the UMVUE of 0^. 


Example 4: Let X, X5,..., Xn be a random sample from a Rectangular 
distribution R(01,05) where 0 < 0; < 05 < oo are unknown. Our problem is 
to find the UMVUE for 05 — 6; and 05 + 01. Note that the joint density is 
given by, 


1 
05 — 0, 


n 
) if, Oi < £1, £2,..., En < By. 


fo, o, (21, La, swa um) = ( 


(i). We know that T = (Xa), X(n)) are jointly complete sufficient statistic for 
01,05, where Xay is the smallest order statistic and X(n) is the largest order 


statistic. 
The p.d.f. of Xin) is given by 


n 
fo, os (1) = (6; — py rm) — 0)" if, 0; < a(n) « 42, 


and hence, 


02 N(X(n) = 0)" 
E(X(n) — 01) = —————d(£(uy — 01) = ——— (09 — 0 1 
(Xim = 81) I, (05 — 01)” v — P) n (2-0) (1) 

. The p.d.f. of Xa) is given by 


TL é 
fo, o (10)) = (6; — 9, — xy)" if, 01 < za) < bə, 


and hence, 
9: n(05 = 2 (1))” n 
Bed =) el 0 qo cae 0; — 0 2 
pacem]. (0 — 0," (65-20)7 7 109-9) (2 
. Therefore, adding (1) and (2) and we get 
EGG — Xap (65 —6)) = —P (6; — 6) 
(Kem Xa) (6-8) = 25 (82 — 1 
Paru xa) = “64 
Or Xo) = pyle! 
T E[*—M8, xa] = (&-6) 
n y2 (n) (1) = 2 di 


Again, subtracting (2) from (1) and we get 


E(X(n) + Xa) — (A+ )) = 0 
=> E(Xq- Xy) = 1+ 6. 


Hence, 24 (Xin) — Xay) and Xn) + Xa) are the unbiased estimators of 


n—1 
0, — 0, and 05 + 6; based on complete sufficient statistic T = (Xa), X). 
Hence,by Lehmann-Scheffe's theorem, ztl (x (nj Xa) and X(,) + Xa) are 
the uniformly minimum variance unbiased estimators of 05 — 04 and 05 + 6; 


respectively. 


Example 5: Let Xj, X5,..., Xn be a sample of discrete random variables 
from N(,07), where u € R is unknown but ø > 0 is unknown. We wish to 
find the UMVUE of v, = P[X1 € c], . The complete sufficient statistic for u 
is X = +>, X;. We start we the statistic 


U = lif X;<c 


= (otherwise 


Then we get, E,(U) = P[X1 < c] = n. for all u. Therefore, U is an unbiased 
estimator of me. If we define ¢(X) = E(U|X), then 


o(X) = P(X € dX) 
P(X, - X€c- XIX), 
Now (X1 — X) ~ N(0,"=*), which is independent of p. Since (X, — X) is 
ancillary for u and X complete sufficient for u, by Basu’s theorem they are 
independently distributed. Hence 


60 - PO =X «e- 3) = 0 E «- 3). 


n—1 


where ®(.) is the distribution function of standard normal distribution. There- 
fore ó( X) is an UE of 7, based on complete sufficient statistic X. Hence by 
Lehmann-Scheffe's theorem, ^ ( k= (e- X)) is the UMVUE of re. 


Subject-Statistics 
Paper- Determination of UMVUE by Complete Sufficient 
Statistic 
Module- 14 
Shirsendu Mukherjee 
Department of Statistics, Asutosh College, Kolkata, India. 


shirsendu.st @yahoo.co.in 


If the Rao-Blackwellized version in the end always comes up with the same refined 
estimator regardless of which unbiased estimator of g(0) one initially starts with, then 
of course one has found the UMVUE for g(0). Lehmann and Scheffe.s (1950) notion 
of a complete statistic, introduced in Module-9, plays a major role in this area. We 


first prove the following result from Lehmann and Scheffe (1950). 


Lehmann-Scheffe Theorem I Suppose that U is an unbiased estimator of a real 
valued parametric function g(0) where the unknown parameter 0 € O C R. Let T 
be a complete sufficient statistic for 0. If we define a statistic ọ(T) = E(U|T), then 
Q(T) is the unique (i.e. w.p. 1) UMVUE of g(@). 

Note: In this theorem the parameter 0 may be vector valued and so is the complete 
sufficient statistic T'. 

Proof The Rao-Blackwell theorem assures us that in order to search for the UMVUE 
of g(0) we need only to focus on unbiased estimators which are functions of sufficient 
statistic T alone. We already know that $(T) is a function of T and and it is an 
unbiased estimator of g(0). Suppose that there is another unbised estimator $*(T') of 
g(@) where ¢*(T) is also a function of T. 

If we define A(T) = (T) — ¢*(T), then 


Eo(h(T)) = g(0) — g(@) = 0 for all 6 (1) 


Now, we use the definition of the completeness of a statistic. Since T is a complete 


statistic, from (1) it follows that A(T) = 0 w.p. 1, that is we must have ¢(T) = ¢*(T) 


1 


w.p. 1. The result then follows. 


Note : In order to find the UMVUE of g(0), we need not always have to go through 
the conditioning with respect to the complete sufficient statistic T. In some problems, 
the following alternate and yet equivalent result may be more directly applicable. We 
state the result without proving it. Its proof can be easily constructed from the proof 


of Lehmann-Scheffe Theorem I. 


Lehmann-Scheffe Theorem II Suppose that T' be a complete sufficient statistic 
for QinO. If v(T) is a unbiased estimator of the of a real valued parametric function 


g(9), then v (T) is the unique (i.e. w.p. 1) UMVUE of g(0). 


Let us consider some examples. 
Example 1 Let Xj, Xo,...,X, be a random sample from a Poisson distribution 
with probability mass function, 


exp(—0)0* 
a! 


folz) = 


= 0 otherwise 


ifs =0,1,..., 


We are interested in finding the UMVUE of g(0) = e^? = Ps(X, = 0). We know 
that, fo(r) belongs to the exponential family. Therefore T = $77, X; is a complete 


sufficient statistic. Consider the following statistic, 


U = qx, 


0 otherwise 


Then we get, Eg[U] = P| X1 = 0] = exp(—0). Hence U is an unbiased estimator of 
g(0). If we define (T) = E(U|T) then, 


P(X, 20, 3*4 X: = t 


ot) = P[Xi-0[T- t| = PE, X: t 


PIX ] 
— Pix, 2 0JP [577 Xi = t| 
B PEX: ew 


Hence,by Lehmann-Scheffe theorem, ¢(T) = (zz 


n 1. 


Example 2: Let Xj, X5,..., Xn be a random sample from N(u, o°) where p € R 
and c € Rt are unknown. Suppose we want to find the UMVUE of p?. If we define 
0 = (u,c?) then 0 € © = Rx R^. Here the statistic T = (X,S?) is complete 
sufficient for 0, where X = 1577 , X; and S? = 4. Y? (X; — X)?. 

Since X ~ N(u, €), we have D(X?) = u? + 2. 

As pn ~ x2 4, we have Ey(S*) = o°. Hence we get Eg(X? — E = u^, for all 0 


Since X? — 5t is an UE of u? based on the complete sufficient T, by Lehmann-Scheffe 
theorem X? — © is the UMVUE of 2. 


Example 3: Let X4, X2,..., X, be a random sample from a Rectangular distribution 
R(0,0) where 0 < 0 < oo is unknown. Our problem is to find the UMVUE for 60? 
where a is a known positive real number. 

The complete sufficient statistic for 0 is T = X(n) = Maz(Xi, Xo,..., Xn). The p.d.f. 


of X(,) is given by 
loc 


Now, 
n 


0 
ET?) = xf eS 


n 


TL 


0*. for all0 > 0, 


n+a 
n+a 


= Eo( i te for all 6 > 0 


Hence Be is an UE of 6° based on a complete sufficient statistic T. Hence,by 


Lehmann-Scheffe theorem, it is the UMVUE of 0^. 


Example 4: Let X4, X»2,..., Xn be a random sample from a Rectangular distribution 
R(01,05) where 0 < 6; < 02 < oo are unknown. Our problem is to find the UMVUE 
for 62 — 6; and 05 + 01. Note that the joint density is given by, 


1 Lan 
Joys i 09s 458) = (; 7 ) if, 01 < £1, £2,..., 24 < 05. 
2—0: 


(i). We know that T = (Xa), X(n)) are jointly complete sufficient statistic for 61, 02, 
where Xq) is the smallest order statistic and X(n) is the largest order statistic. 
The p.d.f. of X(n) is given by 


n OY. 
fo,,62(2(n)) = (852 8, M) — 61) lif B < Tin) < b2, 


and hence, 


. The p.d.f. of Xa) is given by 


TU " 
fe, os (11) = — (0 = ra) if, 01 < T(1) < 05, 
(02 — 61)" 


and hence, 
ma-xo- E 
. Therefore, adding (1) and (2) and we get 
E(X(n) — Xa) + (062-061). = al (42 — 61) 
noe 
> E(Xoj- Xa) = - TQ — 1). 
>E = (n) Xa) = (02-61) 


Again, subtracting (2) from (1) and we get 


E(Xq + Xa) — (6 -0)) = 0 


> E(X in) + Xo) = 6-405. 


I n—1 


Hence, 22 (Xin) — Xa) and X(n) + Xq) are the unbiased estimators of 05 — 0; and 
05 + 6 based on complete sufficient statistic T = (X(1),X(n)). Hence,by Lehmann- 
Scheffe’s theorem, ztl (Xin) — Xa) and Xin) + Xa) are the uniformly minimum 


variance unbiased estimators of 05 — 0, and 05 + 01 respectively. 


Example 5: Let X4, X2,...,X; be a sample of discrete random variables from N (1,07), 
where u € R is unknown but o > 0 is unknown. We wish to find the UMVUE of 
7, = P[X, € c], . The complete sufficient statistic for jj is X = la X;. We start 
we the statistic 
U = lif X,<c 
= 0 otherwise 


Then we get, E,(U) = P[X, € c] = m. for all u. Therefore, U is an unbiased 
estimator of me. If we define ¢(X) = E(U|X), then 


o(X) P(X € e|X) 

= PORE. 
Now (X; — X) ~ N(0, ==), which is independent of u. Since (X; — X) is ancillary 
for u and X complete sufficient for u, by Basu's theorem they are independently 
distributed. Hence 


63) = P - X «e 3) ce (6-3). 


where ®(.) is the distribution function of standard normal distribution. Therefore 


ó(X) is an UE of v, based on complete sufficient statistic X. Hence by Lehmann- 
Scheffe’s theorem, ® ( |(c- X) is the UMVUE of m,. 


Subject-Statistics 
Paper- Cramer-Rao lower bound 
Shirsendu Mukherjee 
Department of Statistics, Asutosh College, Kolkata, India. 
shirsendu_st@yahoo.co.in 


Cramer-Rao lower bound of the variance of an unbiased estimator 
In this section we consider a very important inequality which provides a lower 
bound for the variance of an unbiased estimator. Let X = (X1, X5,..., Xn) 
be a random sample from fo(-), where 0 belongs to ©. Let T = t(X) be 
an unbiased estimator of g(0). We will consider the case where fo(-) is a 
probability density function; the development for discrete density functions 
is analogous. Suppose Fy = { fo(x) : 0 € ©} be the family of joint p.d.f.’s of 
X. It satisfies the Cramer-Rao regularity conditions: 


1. O is an open sub interval of the real line. 
2. The support X = [x : fg(x) > 0} does not depend on 0. 
3. 2 log fo(x) exists for all x and for all 0. 


4. For any statistic A(X) with Ee(|h(X)|) < oo for all 0, the the oper- 
ations of integration and differentiation with respect to 0 can be in- 
terchanged in Eg(h(X)). That is, 2 E, (h(X)) = 8 S h(x) fo(x)dx = 
h(x) fo(x)dx for all 0. 


5. Fisher’s information [(0) = Es [Ca fs) exists and is positive 
for all 0. 


Theorem Let JF, satisfies the above regularity conditions and T be an 
UE of an estimable parametric function g(9) such that g'(0) = $9(8) exists 
for all 0. If Varg(T) finite then 


Varg(T) > 


Note: The quantity in the right-hand side of the above inequality is called the 
Cramer-Rao lower bound (CRLB) for the variance of an unbiased estimator 
of g(0). 


Proof Let us define the score function S(x, 0) — 2 log fo(x) = 50 2 fo(x). 
Then we get, 


Fe(S) = je fo(x)dx 
= T 
= 2 [ fuod T 
o 
= i 


Vare(S) = Eo(S?) = Eo (Àlog fo(x)) 
Cove(S,T) = Es(TS) 


= fix a 


a LO 


y AAN Eo- g(0). 


If po(T, S) is the correlation between T and S, then 
p«T,S) € 1 
=> Cov3(T,S) < Varg(S)Vare(T) 
= (g(0)? = I(0)Vars(T) 

(g (0))" 
| (0). 
Equality sign holds if pg(T, S) = 1 which implies that T is a linear function 
of S with a positive slope i.e. 

T — a(0) + ((0)S with probability 1 (1) 
where a(0) and b(0) are constants with b(0) > 0. Since E$9(S) = 0, taking 
expectations in (1) we get, E&(T) = a(0) which implies that a(0) = g(0). 
Since V4(S) = I(0), from (1) we get, b(0) = 7 Therefore to hold the 
equality in CRLB we must have 


=> Varo (T) 


T — g(0) = m E log fo(x ) with probability 1 (2. 


2 


Note : If a family of distributions satisfies the Cramer-Rao regularity con- 
ditions and if T(X) be an UE of an estimable and differentiable parametric 
function g(@) such that (2) holds, then T(X) is the UMVUE of g(0). 

Now we consider the following results. 

Result 1 In Cramer-Rao inequality, equality hold if and only if f(x) is of 
exponential form. 

Proof. If part, Given that the family, is of exponential form, 


folz) = u(@) expt) v(x) 

= expls «(0)-ra(6)t(2)-og u(2) 

log fo(x) = log u(@) + q(O)t(x) + log v(x) 
= k(O)+q(A)t(x) + h(x), 


where, k(0) = log u(@) and h(x) = logv(x). Therefore, 


D log fol) eR (5)- q'(B)i(z) 


90 
HAF ON tog r9) — FO) 
P= (0) 00 5797 gg) 
MON | 1 8 
«e (E) v q (8; 00 198 10): 


and hence the equality condition for Cramer-Rao lower bound is attained. 
Only if part. Given that the equality condition in Cramer-Rao lower bound 
holds, 


Tí) = ol) E E los fla) 
a O I(0) 
5p eg fe(z) = gi) V e) ee) 


Hence by solving the differential equation, 
fola) = u(8) exp? v(x). 


So, f(x) is of the exponential form. 


Result 2 If T is a biased estimator of g(0) with bias B(0) such that B'(0) 
exists. Then 


+ B*(9) 


Proof: Since F(T) = B(0) + g(0) it follows that 


Cov,(S,T) = Ë EST (9) = B'(0) + g(). 


In Module 2 , We have already seen that 
MSE,(T) = Vare(T) + B*(0) 


By CRLB, 


Hence we get 


Some examples: 


Example 1 Let X 1, X2,...,X, be a random sample from, felz) = 
0 exp(—@x) for x > 0. Take g(0) = 0. We are interested in finding that UE of 
5 whose variance attains the CRLB. Note that, 3 log f(x) = 2 — Yita vi = 
n(4 — z) and from (2), X is the 5 whose variance attains the CRLB i.e. X 


is the UMVUE of z. 


Example 2 Let X;,X»,..., Xn be a random sample from, fi(x) = 
exp(—A)A*/a! for x = 1,2,.... Let us find the lower bound for the vari- 
ance of an unbiased estimator of A. 


ð : X poa 1 
(ren) = Fy (Š — 1) | = iar X) ub. 


Therefore the CRLB for the variance of an unbiased estimator of A is equal 


to à. 
n 


Ey 


Example 3 Let X, Xo,...,X; bea random sample from, N (u, o?) with 
u unknown and c known. Let us find the lower bound for the variance of an 
unbiased estimators of u. Now here, 


1 (x — uy 
so- om (- e£?) 


So, 


a — Xr-2) | (e- 4) 
TUS faz) =- 202 (-1)= E 
Hence, 
E, (2 toe 020) S (s2) Ba x qu 
Ou o? g? 


Therefore the lower bound for the variance of an unbiased estimators of u is 
c? 

Some remarks on CRLB 

1. If the regularity conditions are violated then the the variance of the 
UMVUE may be less than the CRLB. 

Example Let Xi, X»,..., Xn be a random sample from R(0,0). We know 
that Xin), the largest order statistic, is a complete sufficient statistic for 0. 
The density for X(n) is given as, 

Ty 
foamy) = ng 


= 0, otherwise. 


if 0 < £in) «0 


By Lehmann-Scheffe theorem a function of X(n) which is an unbiased estima- 
tor of 0 is the UMVUE of 6. Now, Eo(X (ny) = 40 and Eg(X0,) = 2. 


n+1 n+2 
So, 
on p A” \ p, 0t’ = n(n +2) oo i 2 
Vau cc A) Oo (Xr) 57 x12 


By Lehmann-Scheffe theorem, T' — BH X (n) is the UMVUE of 0. Therefore 


Var(T) — CIE The joint p.d.f. of X = (X1, X»,..., Xn) is given by 


1 
folx) = 5; if 0< z; < 1—12.... 


Since E, (é log E = gz, the CRLB of the variance of an UE of 0 is 


equal to e which is greater than the variance of the UMVUE of 6@.This is so 
because here the the support of the density depend on the parameter 0. 


2. The variance of the UMVUE may not attain CRLB. 

Example Xi, X5,..., Xn bea random sample from N (0, 1). Suppose g(0) = 
0?. The CRLB for the variance of an unbiased estimator of 0? is equal to 
ud By Lehmann-Scheffe theorem, T = X? — E is the UMVUE of 0? and 


Varg(T) = am + 5. So variance of the UMVUE may not attain CRLB. 


Statistical Inference I 
Module- 17 
Shirsendu Mukherjee 
Department of Statistics, Asutosh College, Kolkata, India. 
shirsendu. st Qàyahoo.co.in 


Bhattacharyya System of lower bound 

In this module we shall discuss a generalization of Cramer-Rao lower bound 
which is known as Bhattacharya system of lower bounds. Let us consider 
a family of distributions Fy = (fs(x);0 € ©} which satisfies the following 
Bhattacharya regularity conditions, 


1. O is an open interval of the real line. 
2. The support X = [x : fe(x) > 0} does not depend on 8. 


3. For some integer K, 2 fo(x) exists for all x and for all 0 for all i — 
1 K. 


pes 


A, p any statistic A(X) with E&(|h(X)|) < oo for all 6, A Eo(h(X)) = 


35 5 f h(x) fo(x)dx = f h(x) 25 fo(x)dx for all 0 and for all i = 1,..., K. 
5. The matrix Vx = ((v;;)) exists for all 0 and is positive definite where, 
vij (0) = Bo |( log fo(x)) (3 log fo(x))| for all 6 € ©. 


Theorem. Let the family of distribution Py = (fo(x),0 € O} satisfies 
the Bhattacharya regularity conditions and g(0) be a real valued function of 
0 such that g(0) is K-times differentiable for some integer K. Let T' be an 
unbiased estimator of g(0) such that Varg(T) finite for all 0 , then 


Vare(T) > g'Vk !g, for all 0 


where g = (g™ (0), g® (0), .. . , 99 (8))' and g?9(0) = 29(0). 

Note :- For K=1, the Bhattacharya lower bound (and the regularity condi- 
tions) reduces to Cramer-Rao lower bound (and the corresponding regularity 
conditions). 

Proof: Define, S;(0, x) = 505 Bos 9. fo(x). Hence, 


1 o 
/ Fol) agi 7^ 09 folxax 


1 


Therefore, V(S;) = va, cov( Si, Sj) = vij, and 
cov(5, T) = E(S5,T) 


1 æ 
= | t(x)—— — d 
[69 ag ag lol foods 
ai 
= s / t(x) fo(x)dx 
= dm = 90) 
0g! 
Define, 
vare(T) | g (8) -- - 9 (8) 
g™ (0) 
SATLA+T — dispersion matrix of (T, S168, ..- Sk) = Vk 
g (0) 
Since X is a positive definite matrix. Therefore det(X) = |V ||varg(T) — 


g'Vk !g| > 0. Hence, vare(T) > g'V yg. 


Case of equality. Equality sign holds if |=] = 0 i.e. Rank(3) < K +1 
and hence Rank(3) = K since X contains V which is non-singular. Hence 
there exists a non-zero vector | such that 


T — g(0) YS, with probability one where S = (S1, $5,... SkY 
—T-—g(0)—g'Vy S, with probability one. 
Proof: Note that 
Vare (IS — g'Vk !S) = Vare(T —g(0) - g Vk !S) 
Varg(T) -- g Vk ! Disp(S)Vk !g — 2Cov(T,g Vx tS) 
gVx g-gVx g-2gVx g 
0 


Hence, I'S — g/Vk !S = E(YS — g'Vk !S) = 0, with probability one. 
Thus we get 
T —9g(0)—-YS —g'Vk !S, with probability one. 


Now we consider a very important result relating to Bhattacharya system of 
lower bounds. 


Result: Suppose the n" lower bound be denoted by A, = g, V.!g,, 
where g/, = (g® (0), g2 (0), . . ., 9? (0))'.'The sequence of lower bounds {A,,} 
is non-decreasing sequence, i.e., A444 > An. Observe that, A; is Cramer-Rao 
lower bound. 

Proof Note that, Anyi = g, 1 Vs Ligna, and 


Eae (9 a (ees sog Oa). 
Suppose the vector g;,, and the matrix V,,, be partitioned as follows 


ga (gig), 


RS Yn | Und iF 


Un+1 | Un4-1,n4-1 
For any non-singular matrix C we can write, 
Any = BOO V.C" C gua 
m (Cgn41') (CVn) [C a4). 


L, (0 
c= (Ri) 


then we have Cgn+1 = (£x, gt) — u, 1 Vaga), and 


If we choose, 


/ 
Io I, 0 Va | Un+1 In | —U,.iVn 
YR = Me EI 0 1 
Unt Vn Us | Un+1,n+1 | 
1 

u Va Vint I, | — ug Vn 
2 7 7 7 

uhi — Upi | Crates Ui Valli 0 | 1 

/ / 

u V, Vn41 — Un+i 
= 7 7 7 

Uni 7 Unyi | Untijnt1 — U, 1 VnUn+ı 


Va| 0 
0 porem 


where Ensinti = UVnjjin4i — U5,4 VaUn44. Since, V444 is positive definite 
matrix, V, is p.d. and En+1,n+1 > 0. Therefore, 


_ V 0 
(eve? = ( a i ) 


| Enyi nyi 


So finally, 
An+1 = (Cgn+1) (CVs (e=) (Cg) 


(gi = ag Vagn) 


Esa sii 


2 


= g, Vz (Men + 
Ass 


Note : The implication of the result that if there is no UE of g(@) which 
attains the n th lower bound A,, then a sharper lower bound A,,; can 
be obtained and looked at. If an UE satisfies the n th lower bound A,, 
then no further improvement can be made and hence A, = A,,,. However 
An = Ani does not imply that there exists an UE of g(0) which attains the 
n th lower bound. Consider the following example: 


Example : Let X 1, X5,..., Xn be independent sample from N (0,1). In 
the last module we have seen that there does not exists an UE of 6? which 
attains A, i.e. the CRLB. We want to find an UE of 6? which attains the 
Bhattacharya lower bound Ag. 

The joint p.d.f. of X is given by, 


1 ls " 
M= cag em [com - t 
Hence, 
l x 
S = fox) o6 /* 99 

1 TL 

= fa 100%) 2.6 0) 

= 2: nest) 

= n(x — 6) 


and 


1 D xw 
= Folz) 30 (zr) 


By definition we know, E(51) = E(S3) = 0. Therefore, E(S7) = E |n — 0)Y| x 
n^- edi, 
E(S2) --E 5 (X —6)) en? — 2n(X — 9| 
1 1 
= n*8— +n? — 203 — 
n n 
= 2n?, 


and E(S1S5) = n?E|(X — 0)*| — n? E(X — 0) = 0. Therefore, 


n| 0 
v= (3 r) 


Since the parameter of interest under consideration is g(0) = 0 
g’ = (20,2). Therefore, the V2, BLB for 6? is given by, 


? we have, 


A, = gV'g 
4 1| 0 0 
= e032) (4) 


5 


- 40.2) (4) 
- ied) 


The UE of 6? which attains the Bhattacharya lower bound of order 2 is given 
by 


gV4 !S = eV ~ig 
20.2) ( Ie ) | PG on 


E (= =) ( n(z — 0) 
|^ oAn'n? n*(rz —0) —n 


= (2?-—\-0 
TL 


Hence T — X? -t is the unbiased estimator of 0? which attains Bhattacharya 
lower bound of order 2. 


Statistical Inference I 
Chapman-Robbins Lower Bound 
Shirsendu Mukherjee 
Department of Statistics, Asutosh College, Kolkata, India. 


shirsendu.st @yahoo.co.in 


In Module-16 we have discussed the Cramer-Rao lower bound to the variance of an 
unbiased estimator os some parametric function. But the problem in applying the 
Cramer-Rao lower bound is that in all the cases the underlying assumptions does not 
hold true and therefore we can not apply this bound in all situations. To overcome 
this problem Chapman and Robbins (1951) have provided a lower bound to the vari- 
ance of an unbiased estimator. The importance of this inequality is that it is derived 
without making various assumptions that was done in the case of Cramer-Rao In- 
equality. We shall suppose that the parameter space O is an open sub interval of the 


real line. 


Chapman-Robbins Inequality Let X = (Xj, Xo,...,X,) be a random sample 
from fo(-), where 0 belongs to O. Let T = t(X) be an unbiased estimator of g(0). We 
will consider the case where f(-) is a probability density function; the development 
for discrete density functions is analogous.Let 09 € © be any fixed value of 0, such 


that for sufficiently small h z 0,49 + h € © and fg,(x) = 0 = fa, 44 (x) = 0. Then 


(9(0o + h) — g(00) P 


Vargo (T) 2 Supn 
foo+n (x) 
Foot [XC > I 


(1) 


Note: The quantity in the right-hand side of the above inequality is called the 
Chapman-Robbins lower bound for the variance of an unbiased estimator of g(0). 


Proof Since T' is an unbiased estimator of g(0) we have 


EY J tfo(x)dx = g(0), v6. (2) 


1 


From (2) and using the fact that ffo, i (x)dx — fo,(x)}dx = 0, we get 


glo +h) ~ 9(80) = fitfan) — tfo 09) dx 
- fit- 906) Q9 — fog) Jax 


E Lose (x) — fos QT e Co 
Jeah iE foul 


' his means 


Cov | Ont foe 


So, using the well known result that for any two random variables U and V 


| = g(&y +h) — g(&y). 


{Cov(U, vV? < Var(U)Var(V), 


we see that 
Llo +h) — a0) < Vara (T) Vara e —— 
Since E, | {fax eus x A —á aseo fos) fo(x)dx — 0, from (3) sve get 
Vira TIA {9(40 +h) — g(80)}” _ {g(90 +h) - g(00)1^. (4) 


2 2 
Ufog n (X) — fog (X) fog (X) 
Eo, | 994! " x. | E, | [ES — 1 


Since (4) holds for all values of h, we get 


Varo (T) > Sup 


MELS +h) — g()}? l 
Ey (Pat — 1p 


Remark 1 The denominator of quantity in the R.H.S. of (1)can be written as 


foorn(X) oc foorn(X) |" foy (X) 
BE i] = 0 FB | 20 soo | 
B foo+n(X) |” Ax 
= Bs | fa (X) | 2 f foe )dx 1 
_ op [feo] 
i Pts | fe, (X) | | 


So the inequality can be rewritten as 


{9(90 + h) — g(0)}? 
feg--n (x) i 
Eo l fa) j =a 


Remark 2 If we assume that g(0) possesses a derivative at ĝo, then dividing the 


Varo (T) 2 Supr 


numerator and denominator R.H.S. of of (1) by h? and taking limit as h — 0, we get 


{g (00) ? (5) 


2° 
: X)- X 
lim; :0 La, {fog A ivi D 


Vara, (T) > 


If, in addition, fg(X) is differentiable w.r.t.Ü and the derivative is continuous in a 


certain neighbourhood of ĝo, then the denominator of the R.H.S,of (5) becomes 


limp o Ee, 


h fog (X) h fag (x) 


{ foo+n(X) A A : = limnso | | Aem — d fo (x)dx 
f mio |n) a foo n fo (x)dx 


h fo (x) 


and the Chapman-Robbins inequality becomes identical with the Cramer-Rao in- 
equality. The point to be noted here that even two extra assumptions needed in 
this case imposes no restriction on the nature of T. It may be regarded as an im- 
provement on the Cramer-Rao inequality because it does not require the stringent 


regularity conditions that underlie the latter. 


Example Let X1, X5,... Xn be a random sample of size n from a rectangular distri- 


bution with p.d.f. 


1 
jaz) = 7 if0<2<0 


= 0 otherwise, 


where 0 < 0 < oo. In module-16 we have seen that as the support of the distribution 
depends on 0, the variance of the UMVUE of @ falls below the the Cramer-Rao lower 
bound. So in this situation one may wish to find the Chapman-Robbins lower bound 
to variance of an unbiased estimator of 0.The joind p.d.f. of X = (Xi, X2,...Xn) is 
given by 


a. i0 t; < Ø for all i 


0 otherwise, 


folz) 


In this context the condition fg,(x) = 0 = fo,in(x) = 0 implies that if an observa- 
tion X; > 0o then X; > 05 +h which is possible only when A < 0. But the condition 
0, +h € © implies that 05 +h > 0. So —0s < h < 0. If we take g(0) = 0, then by the 


Chapman-Robbins lower bound to variance of an unbiased estimator of 0 is 


h? 
Supp = 
fog +n (x 
Eo { Too) j =f 
Now 
foo +n (X) 05 : ; 
= —— —— if0<2,; < 6)+h for all i 
fe, (x) (o +h)" ° 
= 0 otherwise. 
So 


fax) g- Tu 
Ea | foo (X) | - J [s o; 


where the n-fold integral is taken over the intervals 0 < x; < o + h for all i. Hence 


we get 


fos (X) | | 05 | 1 05 
E = Oo +h)" = 
e] 7 [oy] a nua 
Hence the the Chapman-Robbins lower bound to the variance of an unbiased estimator 
of 0 is 
h?(09 + h)” 
0$ — (0o +h)” 


Sup_oo<h<0 


In particular when n = 1, it reduces to 


S'up ey «n«ot —h(0o + h)}. 


To calculate the supremum note that 


07 (jV. _ 6 
4 DAA — (Ange) <2. 
h( + h) — 7 (2) «^ 
Hence Sup. ejzn«oi—h(09 + h)} = a and the supremum value is attained when 


h = =, Since 6 is arbitrary, the Chapman-Robbins lower bound to the variance of 
an unbiased estimator of 0 is e ifn= 1. 


Remark By Lehmann-Scheffe theorem, T = B X) is the UMVUE of 0 and in 


Module 16 we have seen that with Varg(T) = -2 which becomes when n = 1. 


(n+2) 3 
Hence Varg(T) exceeds the Chapman-Robbins lower bound 9 but it is less than the 
Cramer-Rao lower bound 0? that we have already seen in Module-16. This apparent 
discrepancy occurs due to the fact that in this case the support of the distribution 
depends on 0 and therefore the Cramer-Rao lower bound is inappropriate for this 


problem. So the Chapman-Robbins lower bound is an appropriate lower bound for 


the this problem. 


Statistical Inference I 
Chapman-Robbins Lower Bound 
Shirsendu Mukherjee 
Department of Statistics, Asutosh College, Kolkata, India. 


shirsendu.st @yahoo.co.in 


In Module-16 we have discussed the Cramer-Rao lower bound to the variance of an 
unbiased estimator os some parametric function. But the problem in applying the 
Cramer-Rao lower bound is that in all the cases the underlying assumptions does not 
hold true and therefore we can not apply this bound in all situations. To overcome 
this problem Chapman and Robbins (1951) have provided a lower bound to the vari- 
ance of an unbiased estimator. The importance of this inequality is that it is derived 
without making various assumptions that was done in the case of Cramer-Rao In- 
equality. We shall suppose that the parameter space O is an open sub interval of the 


real line. 


Chapman-Robbins Inequality Let X = (Xj, Xo,...,X,) be a random sample 
from fo(-), where 0 belongs to O. Let T = t(X) be an unbiased estimator of g(0). We 
will consider the case where f(-) is a probability density function; the development 
for discrete density functions is analogous.Let 09 € © be any fixed value of 0, such 


that for sufficiently small h z 0,49 + h € © and fg,(x) = 0 = fa, 44 (x) = 0. Then 


(9(0o + h) — g(00) P 


Vargo (T) 2 Supn 
foo+n (x) 
Foot [XC > I 


(1) 


Note: The quantity in the right-hand side of the above inequality is called the 
Chapman-Robbins lower bound for the variance of an unbiased estimator of g(0). 


Proof Since T' is an unbiased estimator of g(0) we have 


EY J tfo(x)dx = g(0), v6. (2) 


1 


From (2) and using the fact that ffo, i (x)dx — fo,(x)}dx = 0, we get 


glo +h) ~ 9(80) = fitfan) — tfo 09) dx 
- fit- 906) Q9 — fog) Jax 


E Lose (x) — fos QT e Co 
Jeah iE foul 


' his means 


Cov | Ont foe 


So, using the well known result that for any two random variables U and V 


| = g(&y +h) — g(&y). 


{Cov(U, vV? < Var(U)Var(V), 


we see that 
Llo +h) — a0) < Vara (T) Vara e —— 
Since E, | {fax eus x A —á aseo fos) fo(x)dx — 0, from (3) sve get 
Vira TIA {9(40 +h) — g(80)}” _ {g(90 +h) - g(00)1^. (4) 


2 2 
Ufog n (X) — fog (X) fog (X) 
Eo, | 994! " x. | E, | [ES — 1 


Since (4) holds for all values of h, we get 


Varo (T) > Sup 


MELS +h) — g()}? l 
Ey (Pat — 1p 


Remark 1 The denominator of quantity in the R.H.S. of (1)can be written as 


foorn(X) oc foorn(X) |" foy (X) 
BE i] = 0 FB | 20 soo | 
B foo+n(X) |” Ax 
= Bs | fa (X) | 2 f foe )dx 1 
_ op [feo] 
i Pts | fe, (X) | | 


So the inequality can be rewritten as 


{9(90 + h) — g(0)}? 
feg--n (x) i 
Eo l fa) j =a 


Remark 2 If we assume that g(0) possesses a derivative at ĝo, then dividing the 


Varo (T) 2 Supr 


numerator and denominator R.H.S. of of (1) by h? and taking limit as h — 0, we get 


{g (00) ? (5) 


2° 
: X)- X 
lim; :0 La, {fog A ivi D 


Vara, (T) > 


If, in addition, fg(X) is differentiable w.r.t.Ü and the derivative is continuous in a 


certain neighbourhood of ĝo, then the denominator of the R.H.S,of (5) becomes 


limp o Ee, 


h fog (X) h fag (x) 


{ foo+n(X) A A : = limnso | | Aem — d fo (x)dx 
f mio |n) a foo n fo (x)dx 


h fo (x) 


and the Chapman-Robbins inequality becomes identical with the Cramer-Rao in- 
equality. The point to be noted here that even two extra assumptions needed in 
this case imposes no restriction on the nature of T. It may be regarded as an im- 
provement on the Cramer-Rao inequality because it does not require the stringent 


regularity conditions that underlie the latter. 


Example Let X1, X5,... Xn be a random sample of size n from a rectangular distri- 


bution with p.d.f. 


1 
jaz) = 7 if0<2<0 


= 0 otherwise, 


where 0 < 0 < oo. In module-16 we have seen that as the support of the distribution 
depends on 0, the variance of the UMVUE of @ falls below the the Cramer-Rao lower 
bound. So in this situation one may wish to find the Chapman-Robbins lower bound 
to variance of an unbiased estimator of 0.The joind p.d.f. of X = (Xi, X2,...Xn) is 
given by 


a. i0 t; < Ø for all i 


0 otherwise, 


folz) 


In this context the condition fg,(x) = 0 = fo,in(x) = 0 implies that if an observa- 
tion X; > 0o then X; > 05 +h which is possible only when A < 0. But the condition 
0, +h € © implies that 05 +h > 0. So —0s < h < 0. If we take g(0) = 0, then by the 


Chapman-Robbins lower bound to variance of an unbiased estimator of 0 is 


h? 
Supp = 
fog +n (x 
Eo { Too) j =f 
Now 
foo +n (X) 05 : ; 
= —— —— if0<2,; < 6)+h for all i 
fe, (x) (o +h)" ° 
= 0 otherwise. 
So 


fax) g- Tu 
Ea | foo (X) | - J [s o; 


where the n-fold integral is taken over the intervals 0 < x; < o + h for all i. Hence 


we get 


fos (X) | | 05 | 1 05 
E = Oo +h)" = 
e] 7 [oy] a nua 
Hence the the Chapman-Robbins lower bound to the variance of an unbiased estimator 
of 0 is 
h?(09 + h)” 
0$ — (0o +h)” 


Sup_oo<h<0 


In particular when n = 1, it reduces to 


S'up ey «n«ot —h(0o + h)}. 


To calculate the supremum note that 


07 (jV. _ 6 
4 DAA — (Ange) <2. 
h( + h) — 7 (2) «^ 
Hence Sup. ejzn«oi—h(09 + h)} = a and the supremum value is attained when 


h = =, Since 6 is arbitrary, the Chapman-Robbins lower bound to the variance of 
an unbiased estimator of 0 is e ifn= 1. 


Remark By Lehmann-Scheffe theorem, T = B X) is the UMVUE of 0 and in 


Module 16 we have seen that with Varg(T) = -2 which becomes when n = 1. 


(n+2) 3 
Hence Varg(T) exceeds the Chapman-Robbins lower bound 9 but it is less than the 
Cramer-Rao lower bound 0? that we have already seen in Module-16. This apparent 
discrepancy occurs due to the fact that in this case the support of the distribution 
depends on 0 and therefore the Cramer-Rao lower bound is inappropriate for this 


problem. So the Chapman-Robbins lower bound is an appropriate lower bound for 


the this problem. 


Statistical Inference I 
Cramer-Rao lower bound in case of several parameters 
Shirsendu Mukherjee 
Department of Statistics, Asutosh College, Kolkata, India. 


shirsendu.st @yahoo.co.in 


In this module we consider a generalization of Cramer-Rao when there are more 
than one parameter. Let X = (X1,X2,...,Xn) be a random sample from fg(-), 
where the vector valued parameter 0 consists of k elements (01,05,...0,) belonging 
to the parameter space ©. Consider a vector of linearly independent parametric 
functions (g1(0), g2(0),... g,(0)) over the parameter space O. Let T = (Ti, T5,... Th) 
be a vector valued statistic such that 7; is an unbiased estimator of g;(0) for all 
i = 1,2,...k. We assume that the variance-covariance matrix X of T exists and is 
positive definite. Suppose Fg = (f(x) : 0 € ©} be the family of joint p.d.f.'s of X. 


It satisfies the following regularity conditions: 
1. O is an open sub interval of the real line. 
2. The support A = (x : fe(x) > 0} does not depend on 0. 
3. ad log fg(x) exists for all x and for all 0;, i =1,2,...k. 


4. For any statistic h(X) with Eg(|h(X)|) « oo for all 0, the the operations 
of integration and differentiation with respect to 0; for all i = 1,2,...k can 


be interchanged in Eg(h(X)). That is, as; Eg (h(X)) = as; J h(x) fg(x)dx = 


J h(x) as fg (x)dx for all 0j, — 1,2,...k. 


5. The elements of the matrix D — (d;;), where d;; — ag (0), i e&13g- 


1,2,... k exists and is non-singular. 


6. The elements of the information matrix A = (6;;), where 


exist and are such that A is positive definite. 


Theorem F; satisfies the above regularity conditions then the matrix Xp — DAID’ 


is n.n.d. 


Proof Let S = (S1, S2,..., Sp) be the vector of score functions where 


= ee Ce ee 


06; 
Then for each i = 1,2,...k, 
Egi) = [yag fo fg od 
= Z- fe (od 
A ag. J feos 
» x ~ 


and for all i < j = 1,2,... k, 


Hence the variance-covariance matrix of S is equal to A. Also for alli < j = 1,2,... 


Covg(Ti, 5; = Eg(1i5;) 


= 5; Jt) fo(xax 
a 


5j; Fe (T9) = 3:0(8) = dy 


j 


k, 


Now for any two fixed vectors u € R* and v € RK let us define two random variables 


Y and Z such that 
Y = u'S,Z = v'T. 


Then we get 
Covg(Y, Z) = Eg(uST'v) = u'Dv, 


2 


Varg(Y) = u'Au, and Varg(Z) = v'Ev. 
Then by Cauchy-Schwarz inequality 
(u'Dv)? € (u'Au)(v' Erv). (1) 
If we set u = A !D'v then as (A = A^1, then from (1) we get 
(vDA D'y}? < (VDA ! AA !D'v)(v-X4v) = (vDA !D'v)(vX«4v). (2) 
However since A is pd and D is non-singular we have (v/«DA !D'v) > 0 and from 


(2) we get 


(v DA !D'v) € (v'Erv) 
or v(XYy -DA !D)vz0,VvecRF. 


Thus Xy — DA-!JD' is n.n d.. 


Remark 1 For the ith component of T 
k k 
Varg(T;) 2 Y 5 dy da, 
jeilzi 
where ô?! denotes the (j, l) — th element of A^!. In particular, when k = 1 we have 


5 2 
as. log fo, œ) | : 


The R.H.S. of the above inequality is Cramer-Rao lower bound to the variance of an 


a 2 
Varo, (Tı) > (di)?6 = [5,00] / Eo, 
1 


unbiased estimator Tj of g1(04). 


Remark 2 If g = (041,05,...0,), we have D = Ip where J; denotes the identity matrix 
of order k. Then the theorem states that Xp — AW! is n.n.d.. 
Remark 3 Since X4 — DA~!D’ is n.n.d. and DATED is also n.n.d. it follows that 


[Er] > [DAD] = |D/|A| 


3 


In particular if g = (041, 02,...0,) then 
|| > 1/|Al. 


Remark 4 In the above theorem we have assumed that the dimension of the param- 
eter is the same as that of the vector of functions that we want to estimate. This 
need not be the case. Suppose 0 = (01, 02,... 0m) and g = (g1(0), g92(0),... 94 (8@)) 
and let D = (dij) be a k x m matrix of 25, 9i(9), i = 1,2,...k,j = 1,2,.. m. Then 
the theorem states that Xp — DA !D' is n.n.d. V0 € © C R”. The proof is exactly 
similar in case k = m. The case k < m is usually of interest which corresponds to say 
for example, estimating (61,05, .. . 0.) only while (0,,1,01,5, ...0,,) acts as a nuisance 
parameter. In this case the theorem states that Ny — A! is n.n.d. where A^! is 
partitioned as 

All A2 

A2 A2 
Example Let X;, X»,... Xn beiid. N(u, 0°) variables. Writing 0 = (01,02) where 


AS 


0, = u and 05 =o”. the p.d.f. of X; is given by 


1 —1 
fg(z) = Te Lag, — 8] ,—00 < £ < oo, —oo < fı < oo, 0, > 0. 


We know that T; = X and T; = S? are unbiased estimators of 0, and 05 respectively 
where X = 1572, X; and S? = Y? (Xi — X). 

Since X ~ N(0;, 2) and fs ~ x2 , we get 

Vg(11) = 2 and Varg (T2) = € 


As X and S? are independent we get Covg (Tı, T3) = 0. Hence the variance-covariance 


matrix of T' is given by 


MT = 


cose 


The elements of the score vector S = (S1, $5) are given by 


_ Olnf _ (x — 601) 


p 80, 0; 


and 


5:— 96, — 36, 2 
Thus " 
OlnfV — 1 ; 1 
Eg ( 80, g; Pe 0) = By 
ünf V? 1 1 1 
E —.-F—Eg(X-90 —, Epa(X — 6 
0 | 005 402 * 408 ol 1) 268 ol D 
Dol ol 
u 7 
|. 203 
and 
Olnf \ ( Olnf 1 1 d 
E = —-Eg(X — 6 —, Ea(X — 6 
2i 06, ) í 00; ) c agg eU A) + ggg Eg (X 7 0) 
— 0 
As X;s are iid. 
n n 
011 = 5 022 = 202 and 012 = 0. 


Hence the information matrix is given by 


As in this case D = I5, we get 


Er- A = a 
= o 26 i 
n(n—1) 


Note : Since 7j and T5 are respectively the UMVUE’s of 0, and 05 from the above 
example we see that the the variance of T} attains the Cramer-Rao lower bound but 


variance of T5 does not. 


Statistical Inference I 
Interval Estimation 
Shirsendu Mukherjee 
Department of Statistics, Asutosh College, Kolkata, India. 


shirsendu.st @yahoo.co.in 


Introduction 

So far we have discussed the point estimation of a parameter, or more precisely, point 
estimation of several real valued parametric functions. Such point estimates are quite 
useful, yet they leave something to be desired. In case of continuous distributions 
the probability that the point estimator actually equaled the value of the parameter 
being estimated is zero. Hence, it seems desirable that a point estimate should be 
accompanied by some measure of the possible error of estimate. For instance, a point 
estimate must be accompanied by some interval about the point estimate together 
with some measure of assurance that the true value of the parameter lies within the 
interval. Instead of making the inference of estimating the true value of the parameter 
to be a point, we might make the inference that the true value of the parameter is 


contained in some interval. This is called the problem of interval estimation. 


Confidence Interval 
Let X1, X2, ... Xn bea random sample from a population with p.m.f. or p.d.f. fels) 
where 0 € O(C R) is unknown. Let Tj and T5 be two statistics satisfying 7T; < T» 
for which Po(T, € 0 € T3) = 1— o, where a € (0,1) does not depend on 0. Thus the 
interval [71, Tə] will include the parameter 0 with probability (1 — o). This means 
that if a large number of random samples of same size be taken, and the intervals as 
mentioned above be determined on the basis of these samples, the 100(1 — a)% of 
these intervals will include the true value of the parameter 0. The random interval 
(Tı, T3) is called a 100(1 — a)% confidence interval for 0. The quantity (1 — o) is 


called the confidence level and T} and Tp are called the lower and upper confidence 


limits,respectively,for 0. 
Example 1. Let X4,..., X, ~ N(p, 1). Since we know, X ~ N(p, Ja) Therefore, 


P, rai < vn (X — u) < Ta/2| =1l-a Wwe RE. 
Hence by interchanging sides, 


S Taf2 S To /2 

P — —— < u < X + —— | = 1-a VER. 
.| uc ETT DÀ Sum 

Hence [z — E + E] is a 100(1 — a)% confidence interval of 6. 


Determination of confidence interval using a Pivotal quantity 


Let X = (X1, X»,... Xn) be a random sample from a population with p.m.f or p.d.f 
fo(x). Let Q(x,0) be a real valued function of X and @ is called a pivot if its distri- 
bution is completely known under 0 € O. i.e. the Q has a distribution that does not 
depend on 6. 

Since the distribution of Q is independent of 0 we can find two numbers qı and 


qo (qı < q2) such that, 
Pola € Q(z,0) < q] > 1-o Y0 €O 


Hence [tı (x£), t3(x)] is a confidence interval for 0 at confidence level 1 — a. 

Shortest confidence interval based on a pivot: 

For any fixed (1 — o) there are many possible pairs of numbers qı and qz that can be 
selected so that P(qi < 0 < q2) = 1 — a. Different pairs of qı and qz will produce 
tı and tə. We should want to select the pair of q) and qs» that will make tı and t» 
close together in some sense. For instance, if t2(x) — t1(x) which is the length of the 
confidence interval, is not random, then we might select select that pair of qı and 
qd» that makes the length of the interval smallest; or if the length of the confidence 
interval is random, then we might select that pair qı and qo that makes the average 


length of the interval smallest. 


Thus a shortest length confidence(SLCI) interval is the confidence interval for which 
to(x) — tı (x) is minimized and a shortest expected length confidence interval(SELCI) 


is the confidence interval for which Ey (T5(X) — T1 (X)) is minimized V0 € ©. 


Example 2. Let X = X,..., Xn ~ N(p,1). To find shortest length confidence 
interval for u, based on the pivot Q(X, u) = /n(X — u). We know that Q(X, u) ~ 
N (0, 1) for any p. 


Therefore, tj(z) = X — m and t5() = X+ Fa hence Ta(z) — Ti(z) = (a — d1) 7; 


TL 


is not random variable. Our objective is to minimize qə — qı subject to, 
Pla € Z <q] 2 1— 0, Z ~ N(0,1). 
Let us define, 
V(d1,d2) = (da — 41) +A Fz(a2) - Fz(a) — A — a) 
Hence, by differentiating (qi, q2) w.r.t. qı and q we get the following set of equations, 


and solving for 0 gives us the shortest length confidence interval. 


Fvl) = -1- Mela) = 0 


and, 


5; (q.d) = 1 Mag) =0 


€ fz(a) = fz) 


Therefore, qj = —q and hence q2 = Te, where re denotes the upper 1005 point of 
N (0, 1) distribution. 
Ta 
SLCI : |z — ——, 7 2 
7 Vn Vn 


., Xn ~ N(u, 0°). To find shortest length confidence 


aJ 


Example 3. Let X = X,.. 
interval for u, based on the pivot Q(X, u) = RESI We know that Q(X, u) ~ tr-1 


for any 44,0. 


Pas lq < Q(X, uw) <q] = 1-a 

ba» 
or, Pao ns GEN ea = 9a 
X p- = l-a. 


E 


Therefore, t;(x) = X — an and tə(x) = X + e hence to(x) — ti(z) = (q2 — 01) Fr 


Our objective is to minimize q2 — qı subject to, 


Pļ|q < tn- € q] = 1—a. 


Let us define, 


ylarda) = G2 — 1) +A [Fn (92) - Fina) - (1 — | 


Hence, by differentiating (qi, q2) w.r.t. qı and q» we get the following set of equations, 


and solving for 0 gives us the shortest length confidence interval. 


ð 
— —1 — —* 
93, v q2) Aftn—1 (G1) = 0 
and, 


o 
? = 1 ES A n—1 m 0 
94 q2) NOD 


S fu a(m) = fu a(q2) 


Therefore, qo = —q, and hence q2 = ta n1 and, 


E S = S 
SLCI : |X — CHE X +tan 1]. 
2? 1 /n *? 2? ! n 


Example 4. Let X = X,..., X, ~ N(u, 0°). To find shortest length confidence 
COMO: Made 
interval for o°, based on the pivot Q(X, c?) = Dekt). We know that Q(X, 0?) ~ 


v2, for any u, 0°. 


Palm < Q(X, hw) Sq] = 1-a 
22 
2 i (x; - X) 
or, Pao qı < 2 «x qo — 1 — Q 
c 


d 2 
Hence, to(x) — ti(z) = X (X; — X) (4 — i) . To minimize i — i subject to, 
Pla xxse-1-o. 


Let us define, 


Yla, G2) = (= = J +A INC edes ai) AT o) 


Hence, by differentiating (qi, q2) w.r.t. qı and q» we get the following set of equations, 
and solving for 0 gives us the shortest length confidence interval. 
o 1 
x ——-——4A = 
Bq v q2) q £a AG) 


and, 


o 1 
z- , = — + Af =0 
9, q2) qa fa (a) 


Solving for the above two equations simultaneously, we get 


e dqifa (m)-afa (d) 


—q1/2 „(n+1)/2 —q2/2 Pue 


— exp qi = exp q 


5 


Example 5. Let X = Xj,..., X, ~ R(0,0). To find shortest length confidence 
interval for 0, based on the pivot Q(X(,), 0) = 2r Note that, the probability density 


of W = Xm) is given, 
n 


-1 
— w” 0 « w « 0. 
pa ; w 


Po(w) = 


If we consider the following transformation Y = w, 


ply) = nyt, 0<y<1. 


q2 
[my tay = Pin <¥ <a] = 1-a 
qi 
Ps qi < l <4 = l-a 
1 0 1 
Po < | = l-a 
q2 Xin) qı 
DU Xin 
P; | M < gue ] l-a 
2 qı 
H 4 Y 4 pu m 1 T . . . le = pu bi 
ence, to(z) t Or = Xin) (2 1) . To minimize 7- — 7- subject to, 
q2 
[ny dy = g- i) =- o (1) 


qi 


Now (1 — o)" < q2 € 1 and 


dd; qi dq» qb 


Again from (1) we get 


Hence from (2) we get 


dqo ga’ dq 


Hence, L is a decreasing function of qo. So L is minimum if gz is maximum and the 


maximum value occurs at qo = 1 and hence qı = o?/".Hence 100(1 — a)% shortest 


length confidence interval for 0 is (x in m). 


Example 6. Let (X, Y) = {(X1, Yay... ; (Xn, Yn)} ~ N (tr, be, 07, 02, p). To find 
shortest length confidence interval for p. Let 6 = uı/u2 be the parameter of interest 
with u> z 0. Define, Z;(0) = Y; — 0X; (~ N(0, 03 — 20p + 6?o2)) . Let the sample 
variance for Z;(0), 


$2(6) = — 5 (Z;(0) —Z(0)) = 82 — 2081, + 6*5], 


n-lz 


where, S; and S12 are sample variances and covariance of X; and Y;. Now from ex- 
ample 3 we know that nZ(0)/S(0) ~ t,.; and therefore, is a pivotal quantity 
Q((X, Y),0) = y/nZ(0)/ S(0). Hence, 

Po ln 3 Q((X, Y),0) < qə] —1-a. 


We know —q; = q2 = tn—1,0/2 therefore by symmetry, 


nZ ; 
Or, Ps E < ian = l-a 


or, Po 2 


n—1,a/2 


72 
S2 — 2654 + 252 < "2 2] = l-a. 


Note that, t31 a (93 — 2051 + 6?S7) = nZ?(@) defines a parabola in 0. Depending on 
the roots of 0 the interval can be a finite interval, the complement of a finite interval 
of the whole real line. 

Remark 

Like point estimation the problem of interval estimation is twofold. First there is a 
problem of finding interval estimators,and, second, there is a problem of determining 
good, or optimum, interval estimators. In this lecture we have discussed a method of 
finding a good confidence interval by minimizing the expected length of the interval. 


The other criteria of getting an optimum confidence will be discussed at the end of 


the lecture on testing of hypothesis. 


Statistical Inference I 
Introduction to Testing of Hypothesis 
Module- 21 
Shirsendu Mukherjee 
Department of Statistics, Asutosh College, Kolkata, India. 


shirsendu. st Qàyahoo.co.in 


In Modules 1 - 19 we have discussed the theory point estimation and in Module-20 
we have introduced the basic idea of interval estimation. Now we are going to intro- 
duce the notions of testing of hypothesis. In point estimation some characteristic or 
feature of the population in which we are interested may be completely unknown to 
us and we may like to make a guess about the characteristic entirely on the basis of a 
random sample drawn from the population whereas in interval estimation we try to 
put forward an interval with the hope that the interval would contain the unknown 
parameter with some preassigned probability. In testing of hypothesis, some infor- 
mation regarding the characteristic or feature of the population may be available to 
us and we may like to know whether the information is tenable (or can be accepted) 


in the light of the random sample drawn from the population. 


Historical Perspective The first published test of a statistical hypothesis was by 
John Arbuthnott (Kendall and Plackett,1977) in 1710, who wondered about the fact 
that in human births, the fraction of boys born year after year appears to be slightly 
larger than the fraction of girls. He considered data on the annual number of chris- 
tenings in London classified by sex for the 82 years 1629-1710. Equating christenings 
with live births, his intention was to test the hypothesis H: births represent indepen- 
dent trials with a constant probability 0.5 of a male child being born. If H holds, he 
would say that chance governs the birth of a child. He used the original data only 
to check whether each year was a male year (i.e. having a majority of male births) 


or not. As it turned out, all the 82 years considered by Arbuthnott were male years. 


Using on the theory of Bernoulli trials he observed that if H holds then, irrespective 
of the number of births in different years, the probability of all 82 years considered 
being male years does not exceed (3). As this is exceedingly small, Arbuthnott re- 
jected H and concluded that ’it is Art, not chance that governs’ the birth of a child. 
He argued that this was a proof of divine providence, since some of the boys will be 
soldiers they have a higher risk of an early death, so that a higher ratio of male births 
is needed to obtain an equal ratio of males among young adults. We see here the basic 
elements of a test: the proposition that the male birth ratio is 0.5, related to data 
by regarding these as outcomes of stochastic variables, the calculation that the data 
would be unlikely if the proposition were true, and a further conclusion interpreting 
the falsity of the proposition. The above idea was first used by Karl Pearson in 1900 
for proposing the well-known Chi-squared test of goodness of fit, and later by Fisher 
in proposing various tests of significance. An unified approach for testing statistical 


hypothesis was developed by Jerzy Neyman and Egon Pearson in 1928. 


Notations and preliminaries 

Let 0 be the labelling parameter of the distribution of a random variable X and 
© be the parametric space. As in the case of parametric estimation problem 0 
can be either real valued or vector valued. Instead of f(x) here we use p(x,0) 
as the p.m.f. or p.d.f. of the random variable X and the family of the distribu- 
tion is denoted as P = (p(x,0);0 € ©}. The objective here is to know whether 
p(r,0) € Py C P on the basis of X = x. So our conjecture is H which is our null 
hypothesis. Whenever we fix H we have a corresponding hypothesis called alternative 
hypothesis, often denoted by H 4. Specifically we write H to mean the null hypothesis 
that H : p(z,0) € Py = (p(x,0)5;0 € Op}, which is equivalent to H : 0 € Oy. On 
other hand Hy is represented as H4 : p(x,0) € Pu, = (p(x,0);0 € Ox,}, which is 
equivalent to H4 : 0 € Oy,. Further note that, Or, Oy, C O and Og r1Og, = Q. 


Depending on the nature of a the hypothesis we can classify it as a simple or a com- 


posite hypothesis. 


Simple and Composite hypothesis 

A statistical hypothesis H is called simple if it completely specifies the distribution 
of the random variable X; otherwise it is called composite. If Oy is a single point set 
then H is simple ; otherwise it is composite. 

Example Let X ~ N(0,1). If we want to test H : 0 = 0 against H4 : 0 » 0, then H 
is simple but H4 is composite. 

Example Let X ~ N(u, o°) where both w and o are unknown. If we want to test 
H : u = 0 against H4 : u > 0, then both H and Hy are composite because oc is 


unknown. 


Test of a hypothesis 
A test of a statistical hypothesis H is a rule or procedure by which one can accept or 
reject H. Two types of test are used in practice one is called a non-randomized test 


and the other is called a randomized test. 


Non-randomized test (Deterministic test) 

Let A be the sample space. A Non-randomized test is a rule by which one chooses 
a region w(C X) before the start of the experiment such that whenever X = z is 
observed, H is rejected if x € w; otherwise it is accepted. The region w is called 
the critical region and X — w is called the acceptance region of the test. A Non- 
randomized test is also called a deterministic test because here the critical region is 
found completely before the testing and no probabilistic statement is attached with 


the region. 


Test statistic 


In practice, associated with any critical region w, it is possible to find a statistic 


T — T(X) such that, 
[x € w] & (x: T(x) > Cu}: Upper-tail test (one-sided alternative) 


or, 


[r € w| & (x: T(x) € Ci) : Lower-tail test (one-sided alternative) 


or, 
[x € w| & (x: T(x) € Cy or T(x) > Cu}: Two-tailed test (Two-sided alternative) . 


Such a statistic T(X) is called a test statistic. It is used in specifying the critical 


region w. 


Example Let Xi, X5,... X, be a random sample of size n from N(0,1) population. 
For testing H : 0 = 0 against H4 : 0 > 0 suppose one rejects H if 57., X; > c where 
cis a constant. Then »7., X; is the test statistic for testing H and the critical region 


of the test is w = ((X4,X5,... X4) : Ly Xi > c}. 


Two types of errors 

In a non-randomized testing problem there will be either correct or incorrect decision. 
Correct decision can arise for either of the case when we accept H if it is true of reject 
H when it is false. On the other hand incorrect decision can be either if we reject H 
if it is false and accept H when it is false. The measures used to quantify the two 
scenarios of incorrect decisions are called type I error and type II error. Type I error 
is the one, if we reject H when it is true, while type II error is the one, if we accept H 
when it is false. For example. in statistical quality control Type I error occurs when 
a lot of good quality is being rejected and Type II error occurs when a lot of bad 


quality is being accepted. The two probabilities of two types of errors are as follows 


P(Typelerror) = P(X € w|0 € Oy) and P(TypelIerror) = P(X € X—w|0 € Og,) 


To study the behavior of two error probabilities let us define a function P,(0) = 
P{X € w|0). The function defined by P,(0) on © is called power function of the test 


with critical region w. Then 
P(Typelerror) = 6,,(0),0 € OgandP(TypelIerror) = 1 — 8,(0),0 € Og, 


An ideal test procedure would be one for which 5,(0),0 € Oy and 1—5,(0),0 € On, 
are minimized simultaneously by selecting w in a proper way. However, even in the 
simplest problem where Oy = (09) and Og, = {01}, we can not select w such that 
both 5,(05) and 1 — 8,(01) are minimized simultaneously. This follows from the fact 
that 8,(05) can be reduced only by removing sample points from w i.e. shrinking w, 
but this results ion reducing £,(01) which increases 1 — £,(01). On the other hand 
1 — B,(01) can be reduced only by increasing 8,(01) or by enlarging the set w which 
increases 8, (0o). 

Example Let X1, X5,... Xn be a random sample of size n from N (0,1) population. 
For testing H : 0 = 0o against H4 : 0 = 0ı(> 09) suppose one rejects H if X = 


254 X; » c where c is a constant. Then the error probabilities are given by 


P(Typelerror) = P(X > c) =1— é(Vn(c — 09)) 


P(Typel Terror) = P(X < c) = 9(n(c — 64)) 


If we want to reduce P(T'ypel error) then we have to increase c and hence P(T'ypelerror) 
will increase. On the other hand, if we want to reduce P(T'ypellIerror) then we have 

to decrease c. As a result P(T'ypelerror) will increase. 

Power of a test 

The quantity 5,(0), 0 € Og, is called power of the test w at different 0. Hence 
power of a test represents the probability of a correct decision and it is equal to 

1 — P(Typellerror) . 

Let us consider some examples. 


Example Let X ~ Bin(5,p). For testing H : p= i against p — 3 the following test 


is used: 


w= {x:x > 3} 


Then 


P(Typellerror) = P(X < 3| X ~ Bin(5, D) "m Y (5 (4) (ay I a 


The power of the test = 1 — a - ot 


Example Let X be a random variable with p.d.f. 
p(z,0) = 0e 9". x > 0,0 > 0. 
For testing H : 0 = 2 against H4 : 0 = 1 suppose H is rejected if x > 1. Then 


oo 1 
PU f'helefhor) = P(X > 1) = f 2e "uy = 
1 e 


1 1 
PülupeTIerxp] = Pu,(X <1) = n e "dx=1-- 
0 e 


The power of the test =, 

Example Let Xi, X5,... X, be a random sample of size n from N(0,1) population. 
For testing H : 0 = bo against H4 : 0 = 01(> ĝo) suppose one rejects H if X > Oot 7, 
where Ta is the upper 100a% point of N(0, 1) distribution. The power function of the 
test is 


8.8) = P, (x > y+ uj -1- &(, + Vi(&y)) 


The power of the test at 0 = 0;(> 09) is equal to £5, (01). 


Statistical Inference I 
Idea of Test function 
Module- 22 
Shirsendu Mukherjee 
Department of Statistics, Asutosh College, Kolkata, India. 


shirsendu. st Qàyahoo.co.in 


In the previous module we have seen that when the sample size is fixed it is not possi- 
ble to minimize both the error probabilities at the same time. A test that minimizes 
the probability of type-I error, in fact maximizes the probability of type-II error. To 
overcome this problem, Neyman-Pearson suggested that one may select a suitable 
upper bound for the probability of type-I error and consider only those critical re- 
gions for which the probability of type-I error does not exceed the selected upper 
bound. Secondly, among all these critical regions select the one for which probability 
of type-II error is the smallest. This upper bound is called the level of significance of 
the test. 

Level of Significance A test with critical region w is called a level a € [0,1) if 
P(X € w|0 € Oy) € a where a € [0,1) is a preassigned quantity and it is called the 
level of significance of the test. 

Note A level a test is also a level ag test (o < oo < 1.). Here a and ag are upper 
bounds of the set {P(X € w|0 € Ox)}. It is quite natural to look for the least upper 
bound of the set {P(X € w|0 € Oj)). This is called the size of the test. 

Size of a test A test with critical region w is called a level a € [0, 1) if SupP(X € 
w| € Og) =a. 

Note A size a test is also a level a test. A test with level a is easily attainable 
whereas it is not easy to attain the exact size a when the underlying distribution is 
discrete. 

Example Let 0 be the probability of getting a head in single toss of a coin. For 


testing H : 0 = I against H4 : 0 > I suppose the coin is tossed 5 times and H is 


rejected if X > c where X denotes the number of heads obtained in 5 tosses and c > 0 
is a constant. Here Py(X > 3) = 0.1875, Py(X > 4) = 0.03125. If we take c = 4 then 
the test is a level 0.05 test but not a size 0.05 test. 


To attain the exact size we use the randomized test. 


Randomized test (Probabilistic test): It consist of determining a real-valued 
function (x) : 
mathcalX = [0,1] (& 0 € y(x) € 1Vx € x) such that, whenever X = x is observed a 
test is rejected or accepted according as success or failure is occurred, by performing 
a bernoulli trial with probability of success equal to ó(r). Here ọ is called test or 


critical function. Here it can be noted that w is a special choice of p where, 
o —ine(z)-l) 
X—w = {x;p(x) = 0} 
In general, a randomized test is of the following form: 


p(t) = lifre E 
= vy ifx € Ey 


where E;s are mutually disjoint subsets of ¥ with E, U Ea U E3 = X. The constants 
c > 0 and y € (0,1) are so chosen that the test is of size a. 


In terms of the test statistic T(X) a two tailed randomized test is given by 


plx) 1 if T(x) < qorT (x) > cu 


Due < T(x) < cu 


with randomization at T(X) = q or T(X) = c. 


Example In the above example if we want to get an exact size 0.05 test we set 


c = 4. Then to determine y we use the equation 
Pu(X > 4) - 4Pg(X = 4) = 0.05 


> y = 042. 


Power function 
y: Probability of rejection of H when X = z is observed. 
Hence, unconditional probability of rejection of H is 


P,(9) 


] epos, oax 
Ey(e(X)),8 € © 


(0.1) 
Therefore, 
P(Type I error) = Es(q(X)),0 € Og 
« Size of o = sup (Ee(o(X)]. 
0cOmg 
And when H is simple, Oy = bo, size is Ea, (q( X)). 
P( Type II error ) = Eg(1— y(X)),0 € Ox, 
= 1- Ek(e(X)) 
= ]— power of the at 0,0 € Og,. (0.2) 


In general, the power function of y is defined by 
P,(0) = Eo(e(X)),0 € 6. 


ọ is a level a test if, 


Ee(q(X)) € o, V0 c Og 


3 


<> supsco, Egp( X) < a, 


where 0 « a « 1. 
Result. Let yı and y2 be two level a tests for (H, H4). Then any convex combi- 


nation of yı and q» is also a level a test for (H, H 4). 


Proof. Take 0 < à < 1, and set 
p(z) = Api (x) + (1 — A)ex(2). 
Here, o : x — [0, 1]. Moreover, 
Eo(g(X)) = AEo(ei(X)) + (1 = A) Bo(ya(X)) € Aa + (1 — Aja, V0 € On, 


<=> E$(q(X)) € avl € Og. 


p-value. Instead of fixing level of significance a one can obtain the level actually 


obtained by a test. This is defined by, 


l*(x) = P {T(X) > T(x)) : Upper-tail test based on T(x) 


U (x) = Po (TCX)leT(x)) : Lower-tail test based on T(x) 


I(x) = 2min(l* (x), 0 (x)) : Both tailed test based on T(x). 


0 € Oy, provided the distribution of T'(X) remains same over the null hypothetical 


set Op. If we assume that the distribution of T(x) is continuous, we have, 
Lp op m 


Moreover, 


I(x), l+ (x) ~ R(0,1) under 0 € Oy. 


[ Actually, 


l” (x) 21— F(T(X)), F being the d.f. of T 


| 


Smaller value of [*(or I~) gives stronger evidence in favor of the rejection of H 


(or against H). Further it can be shown that, 


I(X) ~ R(0, 1) under H. 


Again if the distribution of T'( X) is discrete, we modify our level actually attained 


values by, 


(e) = PAT) >TO) + 5 PoT(X) = T(2) 
(0) BTO <T(x)) +5 PAT(X) = Te) 


for 0 € Og. Here [*(x) and [- (x) are approximately R(0, 1) where under H. It is easy 


to check that under continuous setup, 
I(x) = 2min(l* (z),U (x)) ~ R(0,1) whenever 0 € Og. 


Some Examples 


Example 1. Xj, Xo,..., X, areiid N(p, 07) 


H: u=0 vs. Ha: u>0 


nX 
(i)o = 1 (known) Test statistic = T(X) = va ~ N(0,1) under H : p = 0 


dux 


oO 


c tp, under H : u = 0and 


(ii)o (unknown) Test statistic = T(X) = 


Suppose we want to test 


H:o=1 vs. Ha:o>1 


n 


(iii)u (known) Test statistic = T(X) = Y;(Xi — u)? ~ x? under H 


i=1 


(iii)u (unknown) Test statistic = T(X) = M ;(X; — X}? ~ x2. , under H 


i=1 


Subject - Statistics 
Paper - Statistical Inference I 


Module - The Neyman-Pearson Fundamental Lemma |] 


Most Powerful Tests 


Let, Z be a family of probability measures (defined on a measurable space (£2, «7 )), 


indexed by a parameter 0 € O c R^,d e {1,2,3,...}. 


P i= (Po: Po is a probability measure on (2,7), 0 € O} 


Consider the problem of testing Ho: 0 € Oo against H1: 0 € O1, where Og UO, CO 
and Oo ^ €, = Ø. Let, Ma be the class of all level a tests for (Ho, Hi), a € ]0, 1]. 


M,,:= {p: pisa test function, Egeo,[y(X)] < a} 


E 
Let, X :— ke Xa A x,| be an n-tuple of IID random variables (a random 


sample), taking value in a set X (C R”), with joint density (with respect to some 


o-finite measure u) p(x; 0), 0 € O, corresponding to some Py € 2. 
Definition: A level a test qo (E€ 4) is said to be Most Powerful (MP) for testing 
Ho: 0c Oo against Hi: 0 = 0, (¢ Oo) if 


Bo, [o0(X)] > Ko, [o(X)], Vo € Ma. 


The Neyman-Pearson Fundamental Lemma 
Theorem: [Existence] For testing Ho: 0 = 0$ against Hj: 0 = 01 (7 0o) at some 
level a, there exists a test Yọ given by 

0, if p(z;01) « k - p(w; 8o) 

pols) = (1) 

1, ifp(z;01)7 k- p(x; 0o); 
where k (> 0) and o on the boundary (a: p(x; 01) = k- p(x; 8o) } are so determined 
that 


Boo [yo(X)] =Q. (2) 
1Co-Ordinator: Dr. Shirsendu Mukherjee, Asutosh College, Kolkata 


1 


[Sufficiency| Furthermore, if a test satisfies (1) and (2), for some constant k (z 0), 
then it is MP for (Ho, Hı) at level a. 

[Necessity] Again, if o is MP at level a for (Ho, Hı), then for some k (> 0), it satisfies 
(1) almost everywhere. It also satisfies (2), unless there exists a test of size less than 
o and with power 1. 

Proof: 

In this module, we shall prove the Existence and Sufficiency parts of the theorem; the 
proof of Necessity, requires some prerequisites, and will be taken up in the next module. 
[Existence] We shall, first, show that a test satisfying (1) and (2) actually exists. 
Define, Y := p(X; 01) + p(X; 0). 

As Po, (p( X ; 05) > 0} 


1, so Pg, {0 < Y < œ} = 1. In other words, Y is well defined 
under Ho. 


Let 


a(y) Po, {p(X; 01) > y- p(X;00)] 
d Po, lp(X : 91) >y - p(X 500), p(X; 80) > 0} 


= Po t{Y >y} =: 1- Ga (y), 


where, Go,(-) is the CDF of Y, under Ho. 

Ga, (-) is non-decreasing and right-continuous, and hence, a(-) is non-increasing and 
right-continuous. Moreover, a(0—) = 1 — Ge,(0—) = 1 and a(oo) = 1 — Ga, (oc) = 0. 
Hence, given a (e ]0, 1[), we can find k (> 0) such that 


a(k—) > a > a(k). (3) 
Here, two cases can arise 
e If k is a point of continuity, we have a(k) = a(k—) = a. Then, the test given 


by (1) satisfies (2), regardless of the choice of yọ at the boundary. 


e If k is a point of discontinuity, we set yo at the boundary as 


|. &-a(k) 
cum 4) 


2 


Thus, 


Ee [vo(X)] = Po{p(X;01)>k-p(X;O0)} + y: Po{p(X; 01) - k-p(X500)] 
2 atk) + RS pk — alk) = o 
ER SPEC ato [ (k ) (k)] . 


So, a test of the form (1), satisfying (2), exists. [Q.E.D.] 
[Sufficiency| Our objective, here, is to show that any test of the form (1), satisfying 
(2), is most powerful. 


Pick any pe Ma. Then, for any x € X, consider the product 


[vo(2) — p(x)]|[ple; 061) — k - p(a; 69)]. 


By (1), when p(æ; 01) > k- p(a;09), then yo(a) = 1 > (x). Similarly, when 
p(2;01) « k - p(x; 6o), then qo(z) = 0 € p(x). In both cases, the product [yo(x) — 
y(a) |[p(2; 61) — k- p(2509)] is non-negative. Of course, when p(z;6:) = k - p(æ; 69), 


the product is zero, and hence non-negative. 


<> Ee, [vo (X)] - Ev, [o(X)] > 


( 
: [Eo [69 (X)] — Eo, [e (X)] | 
| 


[^ Eo lv(X)] < a] 


Eo, [vo (X)] > Ee. [o (X )], Vo E Ma. 


Thus, o is MP in its class. [Q.E.D.] 


Subject - Statistics 
Paper - Statistical Inference I 


Module - The Neyman-Pearson Fundamental Lemma |] 


Some Measure Theoretic Preliminaries 


Let (2 be a set, and & be a o-algebra of subsets of (2. The pair (£2, 27) is called a 


measurable space. X : Q — X (€ R”) is an #/B-measurable map, where Z is a 


c-algebra of subsets of X, usually the o-algebra of Borel subsets of X. Clearly, (X, 2) 


defines another measurable space. We define a measure jz on (X, 2) by 


n) = f s. 


where B is a set in ZZ. The triple (X, Z, u) is called a measure space. 
For a function g(-), defined on X, we define the integral of g(-) over a set B (e £7) as 


[ g(a)dar = / 1s (2)g(a)da, 


1, ifzecB 
where 1g(x) := is the indicator of B. 


0, ifx¢éB 


If g(-) is non-negative on B, 


either, g(x) = 0 on B 
f soe = 0 = 
5 or, p(B) = 0. 


We say that g(a) = 0, almost everywhere (a.e.) on B, or more specifically, g(a) = 0, 


pi-a.e. on B. That is, g(a) = 0, Va e BAN, where u(N) = 0 with Ne 2. 
e If g(-) is non-negative a.e. on B and fẹ g(x)da = 0, then g(x) = 0 a.e. on B. 


e If g(-) is positive a.e. on B and f, g(ax)da = 0, then (B) = 0. 


'Co-Ordinator: Dr. Shirsendu Mukherjee, Asutosh College, Kolkata 


The Necessity Part of the Neyman-Pearson Lemma 
In the last module, we stated the Neyman-Pearson fundamental lemma, proved the 
Existence and Sufficiency parts of the lemma, but stopped short of proving Necessity. 
In this module, having briefly digressed to discuss some pre-requisite measure theory, 
we return to the lemma. 
Before preceeding to the proof, we recapitulate the Necessity part of the lemma. 
If a test Yo is MP at level o for testing Ho: 0 = Oo against Hj: 0 = 04 (7 Oo), then 
for some non-negative real number k, it satisfies 

0, if p(x@;01)< k- plx; 00) 


po(æ) = (1) 
1, if p(a;01)> k - p(a; 60) 


almost everywhere. It also satisfies 
Eo, [oo (X )] = a, (2) 


unless there exists a test of size less than o and with power 1. 

Proof: 

[Necessity| Here, we shall show that any MP test at level a for (Ho, H1) is, necessarily, 
of the form (1), except on a set of u-measure 0. 

As before, io is a test satisfying (1) and (2). Also, let y* be MP at level o for (Ho, Hi). 
We have |cf. (5) and the argument leading upto it, in the proof of Sufficiency], 


[yo(x) — y*(x)|[p(x; 01) — k + p(x; 8o)] > 0, Vx e X. 
Consider the sets: 
Aj := (2: yo(x) 7 c'(z)) and A; :— (2: p(x; 01) 7 k- p(w; Oo) $. 


Then, 
[Io = e" ()] [p(a; 01) — k - p(x; bo) da: 


p | [yo(x) — y*(x)|[p(@; 01) — k + (2; 65) |dav. 
Ayn Ag 


2 


If u(A1 ^ A2) > 0, then Ep, [eo ( X.) — e*(X)] > 0, i.e., Yo is more powerful than 


y*, which is a contradiction. So, to get Ee, [eo(.X) — y*(X)| = 0, we must have 


L(A, A A2) = 0, i.e., Yo(a) = y* (x) p-a.e. 

In other words, MP tests are unique (of the form (1)), except on the set {æ : p(2;61) = 
k - p(z;09)]. [Q.E.D.] 

Having shown how to ’build’ a most powerful test, we prove, next, an important 
property of theirs. In any reasonable test, we would want the power to exceed the 
size — a condition, technically, called unbiasedness [not to be confused with the 
notion of an unbiased estimator]. 

Theorem: Most powerful tests are unbiased. 


Proof: Let yo be a most powerful test at level a. Take (x) = a. Obviously, 


p € Ma, and Eg|e(X)]| =a, V0 € O. As po is MP at level a, so 


Power(zo) = Eo, [zo (X)] > Eo, [e(X)] = a = Size(yo). [Q.E.D.] 


Statistical Inference I 
Shirsendu Mukherjee 
Department of Statistics, Asutosh College, Kolkata, India. 
shirsendu. st Qàyahoo.co.in 


Hypothesis Testing in Uniform [0,6] - I 


The Set-up 
Data: 


Xi, X2,, Xn ~ Unif[0,0],0 > 0. The X,’s take value in X = {x ‘= 
(x1,%2,...,%,) : zj > 0,V j}, which we call our sample space. 6 is the 
unknown parameter here. Our inferential procedures will focus on testing 
hypotheses on 0. 

The p.d.f. of each X;, is of the form 


0, otherwise. 


nodi ifz «0 d) 


The joint p.d.f. of X1, X5,..., Xn, is 


nx) = ] [hl = ff mun (2) 


nN 0, otherwise, 
j=l 


where x(n) is the maximum of the z;'s. For n = 2, we may visualize the 
sample space 9X as in figure 1 

The Testing Problem 

Suppose we are interested in testing 


Ho : 0 = 09 (known) against H; : 0 > 6o. 


We start with 
Ho : 0 = bo versus Hj : 0 = 01(7 bo). 


Our aim is to find a most powerful (MP) test for the problem of testing Ho 
against Hj. Figure 2 explains the hypotheses Hp and H}. 
Illustration of the Problem 


Figure 2: Ho vs. Hi 


The region in pink is the sample space, under Hy. Mathematically, the 


pink (P) and yellow (Y) zones can be identified as 
= {x : £) < fo}, 


and, 
Y:- [x : Oo < Tin) S 01). 


Derivation of MP Tests 


lo dix € 
p(x) = o 00 (3) 
0, if T (m) > 0. 
UY OF if a ES 
k pax) =A A EA (4) 
0, if T (n) > 09. 
Take k(> 0) such that 
Dok 
0t O 


po (X) > k- pao (x) € po, (x) = ie and k - pa (x) = 0,i.e., 09 < tm) < 61. 
The region in violet depicts what the inequalities above say. pọ, (x) < k- 
De (X) €» po, (x) = 0. and k: pa (x) = ae L6. Zn) > 0, and Zm) < 0p. 
The regions in violet depict what the inequalities above say. Clearly, both 


inequalities cannot, simultaneously, hold. 
An MP Test: By the Neyman-Pearson lemma, an MP level a test for 


testing Ho against H1 : 0 = 04, is of the form 
if fo < £in) < 8 
óo(x) = | E ND 
any value in [0,1], if vn) > 01 or £in) € bo, 
such that 
Eo, |óo(X)] = o. 


We will look at some special tests of this form. The test $ is free to 


take any value when £n) > 6; or x(n) € 0o, as long as Es, [óo(X )] = ü: 


3 


So, to make the test free of 01, i.e., to make it UMP for testing Hp against 
Hi: 0 > 0o, we write 


1, if T (n) > bo 


= 5 
bo(x) any value in [0,1], if tn) € 66, (5) 


such that 
Eo, |óo(X)] = o. (6) 
A Randomized MP Test: As a randomized UMP at level o test for testing 
Ho against Hj: 0 > 0o, we modify (1) into 


1, if T (n) > Ao 
E T 
$m (x) s if £in) < bo. (7) 


Of course, Ea, [G01 (X) | =a: Pa, {Xn < Oo} =a. 

Non-randomized MP Tests: An alternative to a randomized test is a non- 
randomized level a UMP test. 

We have to choose a subset Ag, (depending only on 69) of the region 
Hoy, :— (x: Lm) < Go} (on which to reject Ho), such that the size restriction 
Eo, |óo(X)] = a holds. 

Define, 


1, if Tin) > 0o 
$02 (x) = Y, ifxe Ag, (8) 
0, if x E€ Xo — Áp, 


where Ag, is such that 
Po {X € Ag} =a. (9) 


The test $o» of the form (4), satisfying (5), can be diagramatically eluci- 
dated thus: 

At level a, $93 rejects Hp in the violet regions, and accepts in the pink 
region. As $9» is independent of the choice of 04, as long as 0, > ĝo, this test 
is, also, UMP for testing Ho: 0 = 05 against Hı: 0 > 0o. 

Choice of Ag,: Take Ag, = (x: tm) € c], for some constant c (< 4p). 

c satisfies 


Take Ag, 
K satisfies 


le e k-60,41-—2. (11) 


Other Choices of Ag: Any subset of Xoy = (x: Tin) < Oo}, satisfying (5), 
will serve as a potential Ao. A few other choices are given below: 


Statistical Inference I 
Shirsendu Mukherjee 
Department of Statistics, Asutosh College, Kolkata, India. 
shirsendu_st@yahoo.co.in 


Hypothesis Testing in Uniform [0,6] - II 


Power Function 
Before computing the power functions for the tests, we note, For a « 0, 


PX ny <a} = (2). (1) 
PAX (ny) > a} A 4 (2) ; (2) 
While, for b 6, 
AZ <0} = 1, (3) 
P{Xin) >b} = 0. (4) 


Power Function for $91: Consider the UMP level a randomized test 01, given 
by 


a, if Ln) € Oo 
x) = 5 
Pon ) h if T (m) > Oo. ( ) 


The power function of this test is 
Bon (0) = E[ón X] = [a x Po{ Xn < 6) | + [1 x Po{ Xn > 60} |(6) 


a, if 0 < 0, 


"leer ]* [een]. ion 


a, if 0 < o 
1- (1-a) ($), if 6 > 6, 


Clearly, 84,, (0) 70, and B4,,(0) € o, V0 < 0o. So, o1 is UMP at level a, 


for testing 
Ho: 0 < 0o against Hj : 0 > Oo, 


as well. 


Power Function for Tests of the form $9»: We shall, now, derive the power 


function(s) for the test(s) given by (4) and (5): 


1, if T(n) > Oo 
$02 (x) —4l, ifxe Ag 
0, if x E€ Xo — Åv, 


where Ag, is such that Pa {X € Ap} = a. 
Pp{X E Ag} F7. 
Such a test has been shown to be UMP for testing, 


Hg: 0 = 05 against Hı : 0 > 6o. 


We will look at the power function of the special forms of this test, already 
discussed; viz., when Ag, = (x : Xm) € c], and when Ag, = {X : £in) > K}. 


Case when Ag, = (x: Xn) > &(« 09): 


We have seen that the size restriction forces & = 6o 


test becomes 
0, if £m <ho: Vl-—a 
p(x) = | idi. 


1, if T(n) > 0o : V1- a. 


The power function of this test is, 


Booz (8) = Po{X(n) > Ao G v 1- a} 


. fo, if 0 < o- 
1- (1-a): (%)", if > 6- 


- V/1-— o. Then the 


Clearly, 85,,(0) 76, and Bol) < o, V0 < A. So, o2 (as given by (8)) 


is UMP at level a, for testing 


Ho: 0 < 0o against Hı : 0 > 6. 


3 


By (9) 


Case when Ag, = {X : Lin) & c(« bo): 


We have seen that the size restriction forces c = 0o - 4/a. Then the test 
becomes 


1, if {£m € 0o, Yat or {£m > 8, 
o2(X) = f | ES du) | Á o} (10) 
0, if Oo - YA < Tin) X bo. 


The power function of this test is 


1, if 0 < 0o- va 
= 4a-(%)", if 05 - Wa < 0 < bo (11) 


1—(l—a)-(®)", if O> 6. 


Béo2(@) (as given by (11)) is U-shaped, with a minimum value of a at ĝo. 
Though the test is UMP for (Ho, Hi), it is not UMP for (H6, Hi). 


Comments on Test (11) 


As we have seen, the test (11), though, UMP (at level a) for testing 
Ho: 0 = 0$ against Hı : 0 > 9, 
is not UMP for testing 
Hg: 0 < 0o against Hi : 0 > 6. 


4 


B«(8) 


However, as we shall see, later on, it is UMP for tesing Hp: 0 = 05 against 


Ha: 0 + 09. As we will need this test later on, we assign an identifying name 
to it: V. Thus, from (11) 


, if 0o: Ja < Lin) € 0o (12) 
; if T (m) P Ap. 


Statistical Inference I 
Shirsendu Mukherjee 
Department of Statistics, Asutosh College, Kolkata, India. 
shirsendu. st @yahoo.co.in 


Hypothesis Testing in Uniform [0,0] - III 
Another Testing Problem 


Under the same set-up, of X4, X5,..., X, ~ Unif [0,0], 0 > 0, we, now, 
carry out a test for 


Ho : 0 = 09 (known) against Hə : 0 < 6. 


Start with 
Ho: 0- Oo 


versus H5: 0 = 09 (< 0o). We aim to find a most powerful (MP) test for 
the problem of testing Ho against H}. The following figure explains the 
hypotheses Hy and Hj. The region in yellow is the sample space under H3. 


nz2 


X, 


Mathematically, the pink (P) and yellow (Y) zones can be identified as 
y := {x: Lin) S b2}, 


and 
Ps {x: 05 < Lín) S Oo}. 


1 


Derivation of MP Tests 


PT if x n < 05 
pax) = 4 00 (1) 
0, if T (n) > bə. 


if Ln) X Oo 


k 


0, if Tin) > Ao. 


Take k(> 0) such that 


1k 
05 Oo 
0 TL 
ek-[-|. 
0» 
A (80.80) 
X; 
(82, 02) 
(0, 0) X, 


Po (X) > k- pa (x) = pe,(x) = $ and k + pa (x) = 0, i.e., Xn) < 65 and 
T (n) > 0p. 

The regions in violet depict what the inequalities above say. Clearly, both 
inequalities cannot, simultaneously, hold. po, (x) < k-pe,(x) € pe,(x) = 
0 and k - pa (x) = a i.e., 05 < Xn) < bo. 

The region in violet depicts what the inequalities above say. 


An MP Test: 


(8o, 80) 


By the Neyman-Pearson lemma, an MP at level a test for testing Hy 
against H2: 0 = 05, is of the form, 


" 0, if 05 < in) < bo 
o*(x) = ! ae (3) 
any value in [0,1], — if am) > 6o or tn) < 065, 


such that 
Bey [6*(3)] = o. (4) 
Non-randomized MP Tests: 


To yield a non-randomized UMP test for testing Hp against H5: 0 < 9, 
we modify ¢* to (assume 05 > 0o - Ya) 


if bə < T(n) € Ao 


0 
$'(x)— 40, if 0o: Wa < Tin) € 0» (5) 
1, if £in) > 0o Or Tin) € 0o - Va. 


Combining the first two contingencies, we get 
0, if 0 a) = 0 
(ey af fec Va <a) S Bo 
1, if £n) > bo Or Tin) S bo: Ya, 


which is the same thing as 7, as defined in (12). 
Now, we show that w is indeed UMP for testing Ho against H5: 0 < 6p. 
The test 1» (as given by (12)) 


0, if0g- Va < Tm <9 
y(x) = ee RV. (7) 
1, if tm) > % Or Tm) € 0o - Wa 


(6) 


3 


is independent of 05, and gives a UMP test at level o for testing Hp against 
H; : 09: Ya < 0 < bo. For 05 < 05: 4/a, note that y has power 1 (maximum) 
for every 0 < 0o - 4/a, and is, hence, UMP for testing Ho against H4 : 0 < 
fo: X/a, as well. It is possible to identify a constant k, such that an application 
of the Neyman-Pearson lemma gives the UMP test for testing Hg against H4, 
without any reference to the power function. The question is can we guess 
what that constant is? 

Thus, v is UMP at level o for testing Ho: 0 = o against both of H3: 
0, - Va € 0 < ho and H4 : 0 < Oo- Va. It is, therefore, UMP at level a for 
testing Ho: 0 = bo against Hə : 0 < 0o. 

UMP test for Ho versus H4 : 0 z 0: 


As it is, also, UMP at level a for testing 
Ho: 0 = 09 against Hı : 0 > 9, 
we conclude that 1» is UMP at level o for testing 
Ho: 0 = 05 against H4: 0 z 0s. 


Final Comments: 
In the case of Unif [0, 0], for the problem of testing 


Ho : 0 = % against Ha : 0 z 0o 


at level a, we have obtained a UMP test. This, however, is a rare instance of 
the situation where a UMP test for (Ho, Ha) exists. Another situation where 
a UMP test for (Ho, Ha) exists, is for the exponential Exp(0, 1) distribution, 
with p.d.f., 


fola) = (8) 


gi ifr>8 
0, otherwise. 
If, somehow, the Exp (0, 1) distribution can be transformed to Unif [0, 6’], 
our problem would reduce to the case discussed. Can you find some function 
g, such that for X ~ Exp (0,1), g(X) ~ Unif]0, 6’]? 


Statistical Inference I 
Hypothesis testing in Shifted Exponential Population 
Module- 29 
Shirsendu Mukherjee 
Department of Statistics, Asutosh College, Kolkata, India. 


shirsendu. st Qàyahoo.co.in 


In the last three modules we have discussed the UMP test relating to R(0,0) popu- 
lation. We have considered both one-sided and two sided alternatives. In the present 
module we are going to discuss the the UMP test for shifted exponential distributions. 
We consider both the one parameter and two parameter cases. At first we discuss the 
single parameter case where we use the UMP test derived for the R(0, 0) population. 
One-parameter exponential distribution 


Let X = (X4,..., X4) bea random sample of size n from the population with p.d.f. 


f(z,N— exp), 2 > ps 


= 0, otherwise, 
with —oo < u < oo. Also the joint density for X4, X5,..., Xn is given as, 


p(x,u) = exp Lia) 2, > u, Vi 


= (0, otherwise. 


If we define zq) as the smallest order statistics and notice that x; > u, V? & za) p. 


Hence the above joint density can be rewritten as, 


pna) = exp Ea, ary > p 


0, otherwise. 


Suppose our object is to find the UMP test for 


Ho : u = uo vs. H4 : u ig 


1 


. Consider the transformation 
Y; =e 5,64 —1,2,...n. 


Then Y; = 1,2,...n are independently distributed ~ R(0,0) random variables 
where 0 = e^". Based on the observations Y;,? = 1,2,...n, UMP size a test for 


Ho : 0 = 09 vs. Ha: 0 Æ bo is given by, 


pY) = Oife ar < Ym € e^ 


= 1 otherwise. 


In terms of the observations X;,2 = 1,2,...n the UMP test is given by 


: 1 
p(X) 0 if up < Xo) S uo — loga 


1 otherwise. 


Two-parameter exponential distribution 


Let X = (X1,..., Xn) be a random sample of size n from the population with p.d.f. 


f(z,u,c) — exp 7 „TH 
o 


= (otherwise, 


with —oo < u < œ and c > 0. Defining 0 = (u.c) the joint density is given as, 


1 nop. . 
p(x, 0) = — exp“? Dia imn), Ti > l, Vi 
gn 
= 0, otherwise, 


with —oo < u < oo and e > 0. Therefore the parametric space is © = (—00,00) x 


(0, oo). Consider the following testing problems, 
I.Ho: p = Ho, o = 09 vs. Ha: p € [9,0 < Oo 


Io? u = o,o = oo vs. H4: H > Wo, < oo 


2 


Solution.(I) Since z; > u ,V i € za) > p, f(x,9) = 0,6 za < p. Taking 


(141,01) : H1 < Ho, 01 € 00, we have 


p(x,0)) > 0,20) > m 


= 0, zQ) > pa 


p(x, 0o) > 0, £a) > Lo 


= 0, £a) > Lo. 
Now whenever za) > Ho we have, 
p(z, 01), P(x, 09) >0. 


Thus we get 


p(z, 01) 
p(z, 0o) 


= finite, if T1) Ho 


= OW, if (1) < Lo 


Hence, for testing H4 : u = (< uo), 0 = o1(< co) the MP size a test has a critical 
region, 


w= dr iai) < po oF) > in and dn ah 


P(x, Ho, 00) 
where k(> 0) such that, Py,(w) = a. Now, 


p(x, 11,01) 
P(X, Ho, 09) 


B 1 lxx 
-GY(- 2g» (-3) 
O71 OQ O71 qc O71 00 


where xa) > jo. The above ratio is non-increasing in 577 , x; so long as 94 < oo. 


Hence there is a constant c such that, 


Thus w is equivalent to, 


W = aa < Ho Or £a) 2 Ho and Ye), 
i=l 


where c is such that, 


Pos {Xo < uo) + Poo {Xo > po and $ X; < o) = a. 


i=1 


Since P5, {Xo < uo) = 0, we get 


I 
Q 


J a P(X, Ho, To) dx 
$0)2H09 4. 4 m«c 


i.e., X, lo, Co)dXx = a. 
[m "P Ho o) 


i=l 
If we consider the transformation, Y = ES Y (x; — u), therefore Y ~ x2,. Hence, 


0 


c 2 
and if we denote the lower a point of x2, is given by x5, aœ then, c = Mons + nuo. 


Therefore the MP test is given by, 
= TOX3n a 
wW = 4222) < Ho Or zZ(1) > Ho and, » 5 < a. FnM ; 
i=1 


which is independent of any (41,01); 41 < [0,01 < oo. Hence it is an UMP test as 
well. 
Solution.(II) We take (1,01) : 41 > 49,01 < To, Thus we get 


p(z, 01) 
p(z, 09) 


= finite, if Ta) < Hı 


= 0, ifr( > [ly 


Hence, for testing H4 : y = (> Ho), 0 = o1(< o0) the MP size a test has a critical 


region, 


p(x, 61) 
— H d k 
wW {2 ta) > fy an Cn > H 
where k(> 0) such that, Py,(w) = a. Now as in case (I), 
p(x, 01) D 
DR» do 
p(x, 0o) 2. 


Thus w is equivalent to, 
w= rimo? in and ae, 
i=1 
where c is such that, 
Poy (Xq) > m) + Poo [5x z o) =a, 


—n(ug—H) 


— e 90 + Poo Xn < 


2(c — nuo) = 
00 i 


2 


—n(ug—Hu1) 
X1—o52n* 


70 
—c-nu 2 ace 70 


Since the non-randomized MP test depends upon a specific alternative value u1, it is 
not UMP for testing Ho : u = Ho, o = co vs. Ha: u > uo,0 < do. Hence no UMP 


test exists for this problem. 


Statistical Inference I 
Testing of Composite Null Hypotheses against Simple 
Alternatives 
Module- 30 
Shirsendu Mukherjee 
Department of Statistics, Asutosh College, Kolkata, India. 


shirsendu_st @yahoo.co.in 


{p(x,0),0 € O}, O € R?, p > 2. Testing problem : 


H :0 € Oy vs. HA:0 = 0, (known) ¢ Og 


Define, 


M(a,H) : class of all level a for testing H 


le : Eep(X) < oH E On} 


Let w(0) be a distribution of 0 over Ox. w(0) is known. w(0) > 0 V0 € Oy. 


Ja w(0)dp(0) = 1 [u(0) : Measure onO jj] 


p(z,0) : p.d.£./p.m.f. ofX given 0 


p(zx,0)w(0): Joint p.d.f. / p.m-f. of (X,0) over x x Oy. 


Marginal distribution of X is, 


asle) = [| Pe, 0)wOdul0), z € x 


known, when w(@) is known apriori. 


Next consider the testing problem, 


Ho: X az) vs. H4: 0 = 0. 


Hence, H, and H4 are both simple. Therefore, NP lemma, MP test for H,, vs. H4 


is defined by the function, 
Puls) = lifp(z,6:)- kg,(z) 

= aif pu5)ekeux) 

= 0 if p(x, ) < kgu(z). 
with Ej, (X) = a. Suppose such test satisfies, 

Esp, (X) <a V0 € Og. 
Then, such o, is MP size a for (H, H 4), 
M(a,H,) : class of all level o for testing Hy 


lo : Eu,e(X) < aj. 


It can be shown that, 


M(a, H) C M(a, H,) 


Proof. o € M(o, H) 


= J, [eri (6n dp) <a 


X 


e [ 99 n p(x, O)w(0)du(@)| du(x) € a (By Fubini's Theorem) 


" [i (x) gu(x)dplx) € a 
> Eg, y(X) < Q 


Conclusions: 
(i)p,: MP size a for (Hu, H 4) 
(ti) Eopu(X) € a V0 € Og 
(tit) M(a, H) C M(o, Ho) 


2 


To conclude that, Yu is MP size a for (H, H4). We can use the following argument, 


Pw is MP within M(a, Hu) (by (i)) 


= Eo qu(X) > Eo p(X) Ye :P E M (a, Ho) 


Again, by (ii), e, € M(o, H), we have, y, is MP for (H, H4) within M(a, H). 
Example. Let X = (Xı,..., Xn) and X; ~ N(u,c?),i = 1,...,n independently 
0 = (u,0) € (R, R*) C R?. Testing problem, 


Ay: [o? > oĉ, -%0 < u < oo] vs. H4: lo? =0 (5 oê), u = p| 


Ay: lo? < oĉ, -%0 < u< o] vs. H4 : |e? Jo (> 05). = pa] 


Solution: Sufficient statistic for (u, 0?) is, T = (o. Xi Ya (X — Xy) = (u,v). 
So, test based on X = (Xj,...,Xn) has the same power as that based on (u,v). 
For each test o = Q(Xi,..., Xn), we can find another test y = w(u,v) = v(T) 
such that y and w have identical power functions. So, WLOG, we consider to be 
dependent on T, if required. Let w(0) be a non-negative weight function on o? = og, 


i.e., w(0) = w(u, 02), —oo < u < oo with the propoerty that, 
(i)w(u, 05) = 0v 


(ii) f w(u,o8)du = 1 
Here we have the following likelihood functions, 


p(z,0) = co” exp l-z Sa — 2 


i=1 


u a v n á 
co “exp |— >| exp Zz TH) 
We know U and V are independently distributed, where U ~ N (p, =) and 5 P E 


e e-2]. 


20 2g? 


p(u,u, 8) = c0 "v^ exp 


Since, Hy : X ~ g, (x), where g(x) = f° p(x, 0)w(u, o2)dp, therefore, 


Lp 223 U ES n 
plu,v, Ha) = eio v erp -z| f exp -2 - n w(u,o8)du 


known. (0.1) 


Therefore, 


p(x, 01) = plu, v, 01) = cio v? exp E exp ou — n 


Hence the likelihood ratio is given by, 


Co, "v's exp mms exp [= zs (u = md 


R= as ' 
c0 "v 3 exp E f? S, exp |- zis(u — Du w(u, 06) ds 


Consider the following weight function, 


w(p,o?) = Lif (u,07) = (u, 0$) 


0, otherwise. 


Then R boils down to, 


cop" v^? exp |->] exp |-zis(u _ m 
R= aaa 1 1 
pioz "v^ ezp [~z] exp [aay — | 


which increases or decreases with »77 (x; — 11)? according as o? > or < og. Hence 


using the weight function, a MP test for H, against H4 has the critical region, 


w1 = t : va =)" > 3 


i=l 
or, 
We = fe : Ys — uy? < 3 
according as o? > or < o2, where i k2) is such that Py, (w1) = Pg, (w2) = a. So, 
MG — Ha)? 


Under H,, 5 ~ xà 
00 


and kı and ky can be evaluated from the percentile points of x2 distribution. Now, 


we have to veify that, under the above weight function, 


Po(twyk) <a, v0 c Og, k = 1,2 


Here, 
Di n(Xi =a) k 
Py(wrl(u,07)) = n| OM y ili e) 
Oo Oo 
= kı 2 
= P Y > zamo") 


where Y ~ non-central y? with d.f. = n and n.c.p. A? = jga, It can be seen 
that, 
P(Y >) t A? (for fixed c and n) 


which implies, 


PATE > 0] >P[Y > eA? =) 


Hence we get, 


P,(w|0) < a, V6 € Og, 


and we conclude, the test w4 is not MP size o for 
H: |-o0 < u< ogg < e$ vs. Ha: [u = i, e? -—o e?) 
Let us now consider, 


k k 
Polino) = PLY « Silo) « PLY < SIA? -0) 
k 
< P{Y < E = 0) vo? > o2 
0 


= Q. 
Thus, w is MP for, 


H: |-o0 < u < oo, o? 206 vs. Hj: u= i, 0? = oi(« e?)]- 


Alternatively, set 


* * 1 1 
w*(0) =w (u,0°) = Saget? | ogg t 7 Y } o < u oo at g eum 


0 otherwise. 


= o2 
E(u, 03) « N (». i 
u~ N(p, ô’) 
X~N CLE 

TL 


oe c? 
= 2—1 2_ 2 
= d > = -(0] — 09). Thus, 


n n 


To choose 6? such that, ó? + 


Here, we find MP size a test for H, : [X ~ pu (0) = c1 xezp [-35] exp {=ste(u — m) v? 
0 1 


vs. Ha: |u = 11,0? = of > od]. The likelihood ratio for the above test is given by, 


| n—3 


p(r|H.) | ex exp (— 55s j exp (— zs (u — m) v 


1 


942 942 
2ot 206 


p(z|HA) Cy X Mop m » exerp- | 1 1 Jo? fv. 


Hence, MP test for (H,,«, H4) has the critical region, 


n 


Q3 = {a X (z = zi) > k3}, 


i=1 


where, ks is such that, Py,, = o — ks = x? 4,06. Also it can be shown that, 


P {w3|0} € o, VÓ = (u,o?); —oo < u < oo, 0? € ag. 


2 
o 
P Lr > a V — 00 < pi < 00 


V 
mie 
= 
Q 
wo 
oO 
Il 


IA 


P Dos > i ms] V — oo < p< 00,0? < 09 


= qQ. 
The above test is MP size a for 
H : |- < u < oo, o? < og vs. Ha :|u 2 io? =g eh 


6 


Since ws is independent of H 4, the test is also UMP size a for, 


H: [-oo < u < oo, o? < oå vs. H4: [-oo « pxo => aj). 


Statistical Inference I 
Monotone Likelihood Ratio 
Shirsendu Mukherjee 
Department of Statistics, Asutosh College, Kolkata, India. 


shirsendu.st @yahoo.co.in 


In the present module we define the monotone likelihood ratio (MLR) property for a 
family of pmf or pdf denoted by (p(x,0) : 0 € OC R}. we exploit this property to 
derive the UMP level a tests for one-sided null against onesided alternative hypotheses 
in some situations. 

Monotone Likelihood Ration (MLR) family of distribution 

A real parametric family (p(r,0) : 0 € © C R} is said to have MLR property in a 


real valued statistic T(x) if, for any 01 < 0» € O, the following are satisfied. 


1)p(x, 61) A p(x, 02) 


[Distribution are distinct corresponding to distinct parameter points] 


P. : p(x, 05) 
ii) The ratio R(x) = ———— 
() p(x, 1) 


is non-decreasing in T(x) on the set {x : max(p(x, 02), p(x, 01)) > 0]. 
Note If p(x, 05) = 0 and p(r, 01) > 0, R(x) = 0. 
p(x, 02) > 0 and p(x, 01) = 0, R(x) = oo. 


Some examples on MLR families 


1. (p(z,0),0 € © C R} : One parameter Exponential family. Then we can express 
p(x,@) in the form, 

p(v,0) = u(8)exp(q(0)T (x))v(z) 
such that u(0) and q(0) depends only on 0, v(x) is independent of 0 and T(x) depends 
only on x. We set T(x) such that Q(0) is a strictly increasing function of 0. Then we 


have for 0, < 05, 


p(x, 03) OR exp{(Q(62) — Q(61))T(«)}, 


p(r,01)  u(0i) 


increasing in T(x) because Q(0) is a strictly increasing function of 6 . Hence, 
ip(z,0),0 € O} has MLR in T(z). 

Note If (X1, X5,..., Xn) isa random sample of size n from the population with p.m.f 
or p.d.f. p(z, 0) then p(z,0) has MLR in X; T(z,). 

Consider the following examples. 

1. Let X = (Xi, Xs,..., Xn), be a random sample from N(0,1), population. There- 


fore, 


plx,6) = Gr) "Pep (5: - 6) 


I 
m 
8 
"3 
——s 
| 
| 
D 
N 
kaia 
S 
"S 
D 
Me 
Ex 
—— 
Y 
c 
3 
Bos 
N 
m 
8 
"S 
~<a 
| 
| 
8 
ae) 
—— 


i 
g 
~ 
D 
wa 
e 
8 
"S3 
— 
RQ 
An 
D 
— 
a 
— 
a 
Ww 
— 
[- 
» 
wa e 


where u(0) = exp {—26?} Q0) =0, T(x) = RAY, 
and v(x) = (2r) "erp [-i we a?) 
p(x,0) has MLR in T(x) = Xi] zi. 


2. Let X = (X4, X2, . . -, Xn), be a random sample of size n from N (0, 0?), population. 


Therefore, 


p(x, 0) 


Ms 
8 
EX 
NS 


(2r) ?07" exp (-; 2 

= u(8)exp(q(8)T (x))v(x) 
where u(0) = 07", q(0) = —45, T(x) = DL, x; and v(x) = (21) "7. 
p(x, 0) has MLR in T(x) = Y, x? 


=l" 


3. Let X = (Xi, X2,..., X4), be a random sample of size n from Bernoulli(0) 


population. 


2 


where c(9) = 6”, q(8) = In(, $5), T(x) = X v; and v(x) = 1. 
p(x, 0) has MLR in T(x) = Sate 


4. Let X = (Xj, Xs,..., Xn), be a random sample of size n from the geometric dis- 


tribution with p.m.f. 


p(r,0) = 0(1 — 0)", a =0,1,2,...,0<0<1. 
Then 
gin — 9)" Line 
8" exp [inc -8)Y d 
- OEO) 


p(x, 0) 


where c(0) = 6”, g(0) =ln(1 = 8), T(x) = -X x; and v(x) = 1. 
p(x, 0) has MLR in T(x) = —» zi. 


5. Let X = (X, X5,..., Xn), be a random sample of size n from the exponential 


distribution with p.d.f. 
p(x, 0) = 0exp|-0x],x > 0,0 >0 


Now 


p(x, A) 0" exp E Y n 

= u(0)exp(q(8)T (x)v(x) 
where u(0) = 0", q(0) = 0, T(x) = —Y 4; and v(x) = 1. 
p(x,0) has MLR in T(x) = — Oy zi. 


6. Let X = (X, X5,..., Xn), be a random sample of size n from the exponential 


distribution with p.d.f. 


Now 


p(x.) = (5) em Ems 


where u(0) — (1) 40) = —4, T(x) = 3.4 2; and v(x) = 1. 
p(x,0) has MLR in T(x) = X445 


Non-exponential family 


1. X ~ Cauchy(6, 1). 


For any 05 > 64 


<1 asz 64 
— lasz —5 coo. 


Thus Cauchy(0,1) is not a member of MLR family. 
2. X ~ Cauchy(0, 0) 


| 1 8 
| «024 x? 


p(x, 0) 


for any 04 < bə, 


pleba) _ (03\ +2? (min m-a 
p(z,0) 02/0222 \6? 02 + x? 
increasing in z? or in |z|, Thus Cauchy(0, 0) is a member of MLR family in |z|. 


3. X; ~ R(0,0) for i = 1,...,n independently and let X = (X,..., Xn) 


p(x, 0) g "uf Lin) <0 


0 otherwise. 


where x(n) is the largest order statistic. Take 0; < 05. Thus, 


p(x, 02) 0» Í " 
— f " 0 
"TN ( if 0 < Tín) < 01 


0 


is non-decreasing in z(,). Thus R(0,0) is a member of MLR family in x(n). 


4. X ~ Hypergeometric (0,n, N). 0 : unknown, We know that, 


scs 
p(z,0)— ===, max(ín — N +0,0} € x € min{0, n} 
B 
and hence, 


p(r,0-1 — 0+1- N-m-0-«z« 


p(z, 0) 0—r-l N-—80 
0-1 N—n-—-0-x 
— i Ln — l<a< 
RY A Wo if,d+n—-N+1<2<0 


= 0 ifi r =0+n-N 


= Oooifz-0-1 


which is non-decreasing in x. We can use the above to provide the following general 


version. Take 0, < 0; 


p(x, 03) p Bt, 0 + 1) p(x, 60; + 2) p(x, 09) 
= x TE Cy ene a ae 
p(x, 01) p(x, 01) p(x, 0; + 1) p(x, 05 — 1) 


Hence the family of Hypergeometric distribution has MLR in T(x) = z. 


Tr: 


The following useful result is due to Karlin and Rubin (1956). We simply state the 
result without giving its proof. 
Karlin-Rubin Theorem 
For testing, 
H : 80 = Q vs. H4 : 0 > 0g. 


Corresponding to MLR family {p(x,0),0 € O} in T(x), the test given by, 
glx) = 1lifT(x)>c 


5 


= aifT(x)=c¢ 


DAT) «c 
with c and a € [0, 1] are such that, 
Eol X) = P4[|[T(X) > c] + aPa[T(X) = c] = a € [0, 1] 


is UMP size a for (H, H 4). 
Proof. Take any 0, > 0o. Then by NP lemma, MP size a test for testing H : 0 = bo 
against H4 : 0 = ĝis given by, 


p(x) = lifp(z,061) > kp(a, 00) 


a if p(x, 01) = kp(x, 09) 


0 if p(x, 01) < kp(a, 0o) 


where k > 0 and a € (0, 1] are such that, 


Eg (X) =a. (1) 
Let us write, 
= p(z, 01) EN 
Ra) = PES = (F(a), 


where, g(T(x)) is a non-decreasing function of T(x). Let us set, 


c= inf(T (x) : g(T(x)) > k} 


then the above test function can be written as, 


wm = LX Ti 
= apu =¢ 
= DT) <e 


and it satisfies (1), and equivalent to the desired test. Such a test being independent 
of 6;(> 09), is UMP size a for 


H : 80 = Q vs. HA:0 => 0g. 


Statistical Inference I 
Monotone Likelihood Ratio 
Shirsendu Mukherjee 
Department of Statistics, Asutosh College, Kolkata, India. 


shirsendu.st @yahoo.co.in 


Result 1 The power function 6(0) of the test 


gu) = Litt) e 
= aifT(x)=c 
= 0 if (xz) <c 


is strictly increasing in 0 so long as 0 < 6(0) < 1. 


Proof We have seen that the test y(x) is UMP size a for testing 
H : 0 = bo vs. H; : 0 => bo. 


We have to show that 6(01) < 8(05) provided 0 < 8(01) < 1 Note that y(x) is MP of 
size £(01) for testing for testing H* : 0 = 04 against Hf : 0 = 05. Since 0 < 8(061) < 1 
the test y(x) is strictly unbiased i.e. 6(02) > 8(01). 

Corollary 1. Test given by, 


glx) = IX TO e 


| 
© 
BS 
a 
M 
8 
wa 
^ 
Q 


is also UMP for, 
H :0 < bo vs. H4:0 = 01(» bo). 


Proof. Since £(0) is strictly increasing 


B(8) < B(05) = a for every 0 : 0 < o. 


Hence the test y(x) is also UMP for 
H:0 < 09 vs. HA : 0 => bo. 


Result 2. A test y*(x) given by, 


yp" (x) 1ifT(x)<c 


a T es 


Da To e 
with c and a € [0, 1] are such that, 
Eoy X) = PalT(X) < e] t aPa,[T(X) — e| =a € [0,1] 
is UMP size a for 
H : 0 0$ vs. H4: 0, « 09 


Proof For any 0, > 0$ the ratio 


E p(x, 6) 
s A p(x, 0o) 


is a non-decreasing function of T(x) i.e. a non-increasing function of —T (x). If we 


replace 01 by —@; and ĝo by —6o then R(x) is a non-decreasing function of —T (x). 
Hence the test y*(x) given by, 


v^ (x) lif — T(x) » —c 


= aif — T(x) = —c 


0if — T(x) < —c 
is UMP size o for 
H : —0 € —0, vs. H4: —0 > —09 


Hence the result follows. 


Result 3 The test 


e) = LATUS e 
= aifT(x)=c¢ 
= O0ifT(x)<c 


is Uniformly least powerful size a for testing H : 0 = 09 vs. H4:0 < 09 


Proof. Consider the test o*(x) = 1 — y(x), 


v (a) 0 if T(x) »c 


aA T ee 
= UMP TTE) x 6 
with Eg,(y°(X)) = 1 — o. Here v*(x) is a left tailed test of size 1 — a. Such a 


test is UMP of size 1 — a for testing H : 0 = 0) vs. H4 : 0 < Oo among all tests 
p(x): E(Q(x)) € 1 — a. That is, y°(x) satisfies, 


Egg"(X) 2 Eop(X) 
for all 0 < 65 along with Esay (X) = 1 — o and for all 2: Ea, p(X) € 1 — o. That is, 
Eo|1 — e(X)] > Ese(X) 


for all 0 < 69 and for all. : Ea G(X) = 1 — q, i.e. for all 2: Eg (1— G(X)) 2 1— o. 
That is, 

E[l — 9(X)] 2 Ell — e" (X)] 
for all 0 < 6o and for all 2 : Ey,y*(X) = a. Hence we get Egp( X) € Egy*(X) for all 
0 < @ and q* : Ey,y*(X) =a. That means y(x) given by 


gu) = 1lifT(x)>c 


is ULP size o within {y*(x) : Ege (X) = a}. 
Now we consider some applications of Karlin-Rubin Theorem. First we consider the 
some examples where the distribution belongs to the one-parameter exponential fam- 


ily and later we consider the distributions which do not belong to the exponential 


3 


family. 

1. Let X = (Xj, X5,..., Xn), be a random sample from N(0, 1), population. Since, 
ip(z,0),0 € O} has MLR in T(x) = X; Xi, a right tailed test based on »77., X; is 
UMP for testing H : 0 = 05 vs. H4 : 0 > 0s. 


2. Let X = (X, X»,..., Xn), be a random sample from N(0,0?), population. Since, 
(p(,0),0 € O} has MLR in T(x) = 377.4 X2, a right tailed test based on 377 X? is 
UMP for testing H : 0 = 05 vs. H4 : 0 > 0s. 


3. Let X = (Xi, X»,..., Xn), be a random sample from Bernoulli(0), population. 
Since, (p(z,0),0 € O} has MLR in T(x) = Xi; Xj, a right tailed test based on 
Ma X; is UMP for testing H : 0 = 05 vs. H4 :0 > 0s. 


4. Let X = (X4, X5,..., Xn), be a random sample from Geometric(@), population. 
Since, (p(z,0),0 € O} has MLR in T(x) = SY, Xi, a right tailed test based on 
Ya X; is UMP for testing H : 0 = 0s vs. H4 : 0 > bo. 


5. Let X = (X, X5,..., Xn), be a random sample from exponential population with 
mean 1. Since, (p(z,0),0 € O} has MLR in T(z) = —Y, Xi, a left tailed test 
based on X}; X; is UMP for testing H : 0 = bo vs. H4:0 > 09. 


6. Let X = (X, X5,..., Xn), be a random sample from exponential population with 
mean 0. As we know that (p(x,0),0 € O} has MLR in T(x) = 35, Xi, a right tailed 
test based on >>, X; is UMP for testing H : 0 = bo vs. H4 : 0 > bo. 


7. Let X be an observation from Cauchy(0,1) population. We have seen that 
Cauchy(0,1) is not a member of MLR family so there does not exist any UMP test 
for testing H : 0 = 05 vs. H4:0 > Oo. 


8. Let X be an observation from Cauchy(0, 0) population. Since, (p(z,0),0 € O) has 
MLR in T(x) = |X], a right tailed test based on |X| is a UMP for testing H : 0 = 4 
vs. H4:0 » 6. 


9. Let X = (Xi, Xs,..., Xn), be a random sample from R(0,0) population. As we 
know that [p(r,0),0 € O} has MLR in T(x) = Xin), a right tailed test based on X(n) 
is UMP for testing H : 0 = 6) vs. H4:0 > 0s. 


At the end of our discussion we discuss the following interesting problem. 

Problem : For N(0,1) population there does not exist UMP size a test for testing 
H : 0 = bo vs. H4:0F Oo. 

Solution : Let X = (Xj, Xə2,. .., Xn), be a random sample from N (0,1), population. 
If possible suppose there exist UMP size a test for H : 0 = 0) vs. Ha: 0 # 0, and 
let it be ¢(X). Then the test ¢(X) is UMP size alpha for testing H : 0 = Oo vs. 
Ha: 0 > 0s. Again, the test ¢(X) is UMP size alpha for testing H : 0 = ĝo vs. 
HA4:0 « o. This is contradiction to the fact that a test which is UMP for the right 
sided alternative becomes uniformly least powerful for left sided alternatives. Hence 


there does not exist UMP size a test for testing H : 0 = bo vs. Ha : 0 Æ 0o. 


Statistical Inference I 
Generalized Neyman-Pearson lemma-Theory of UMPU tests 
Shirsendu Mukherjee 
Department of Statistics, Asutosh College, Kolkata, India. 


shirsendu.st @yahoo.co.in 


In Modules 23 and 24 we have discussed the Neyman-Pearson (NP) lemma for obtain- 
ing most powerful tests. In the present module we shall consider a generalization of 
NP lemma. In this generalization we consider the same optimization problem consid- 
ered in the NP lemma, but increase the number of side conditions from one to many. 
As in the case of NP lemma, we first state and prove the generalized NP (GNP) 
lemma for the non-randomized tests and then consider the case of randomized tests. 
In the npn-randomized case, the proof of the GNP lemma is very similar to the proof 


of the NP lemma. 


GNP Lemma (Non-Randomized case) 


Let go, g1,- - -, gm be (m+1) integrable real valued functions defined on X. Suppose 


there exists at least one test function y(x) such that, 


f e@aila)dx = e. = L2 conis (1) 
where c;,...,c,, are some known numbers. Let yo(x) be another test function satis- 
fying (1), such that, 

ez) = Lif gole) > Ye] 
= 
= 0 if gola) < Dhol (2) 
where ky,..., km are some constants determined appropriately. Then we have, 


fooled > [e ary (3) 


for all test y(x) satisfying (1). 
Particular case 
Suppose m = 1, go(x) = p(x, 061), g(x) E p(x, 09), ki = kc = o. Then (1) is 


equivalent to the size condition i.e. 


f eris &)dz = a 


Then the test function yọ(x) the above equation and from (2) we see that 


polz) 1 if p(x, 01) > kp(x, 0o) 


0 if p(x, 01) < kp(x, 09) 


From equation (3) we get 


I, qQo(x)p(, 1 )dx > / e(x)p(z, 01)dx 


Hence go(x) is an MP test of size a. 
Proof 


Let us define a function 
Qi) = (pola) = ex) (m0 - Yo). 
i=1 
Then as in the proof of Neyman-Pearson lemma it can be shown that 
Q(x) > 0 for alla € X. 
It follows that 


0 


— 
© 
O 
a 
8 
V 


> [volarao(wde— f eleaolaiae > f vole) Y titre — f, oa) Y titre 


Hence the proof. 


GNP Lemma (Randomized case) 


Let Jo; 91; --- 


; 9 be (m+ 1) integrable real valued functions defined on A. Suppose 


there exists at least one test function y(x) such that, 


nop al = 1,27 m, (4) 


where c,.. 


fying (4), such that, 


polz) 


where 0 < q(x) € 1 and k,... 


Then we have, 


. , Cm are some known numbers. Let yo(a) be another test function satis- 


1 if go(x) > 5 k;g;(x) (5) 
y) it gol) = Y fni (6) 
0 if ole) < Y fuor) (Y) 


, km are some constants determined appropriately. 


[omne > f e rano 


for all test y(x) satisfying (4). 


Unbiased test: A test q is called unbiased test at level o for testing, 


H:0€0g vs. Ha:09c€0g, c (ONOF), 


if 


Esp <a V0€ On and Epp > a V0 € Og, 


3 


In Module 32 we have already seen that UMP does not exist for testing H : 0 = 4 
against H4 : 0 Æ 0, where the sample is taken from a (0,1) population. In such a 
situation, we may impose the restriction of unbiasedness and look for an U.M.P. test 


in the class of all unbiased tests. 
Uniformly most powerful unbiased test (UMPU): A test y® is called UMPU 
test at level a for testing, 
H:0€ Oy vs. H4:60€ On, c (On Of), 
if 
(i)Eoy? <œ VO € Ox (Size condition) 


(ii) Eg? > a V0 € Oy, (Unbiased condition) 


(iit) Egp® > Egy V0 € Og, (Power condition), 


where ¢ is any test function satisfying (i) and (iz). 


Note If 8,(0) = Esq is a continuous function of 0, then (8) implies 
B, (0) = a for all 0 € Og (9), 


where Og is the common boundary of Oy and Oy,, that is, the set of points 0 that 
are limit points of both Oy and Oy,. Tests satisfying (9) are said to be a similar on 
the boundary. 

Example . Let Xi, X5,..., X, ~ N(0,1), 0 is unknown. To test, 


H:0z«0vs. H4:0>0. 
Consider the test 


gx) = 1 if, wnrz > Ta 


0 if, /nz € Ta 


4 


where Ta is the upper 100a% point of (0, 1) distribution. 
Here, 


Oy = (0:0 « 0), 

On, = (0:0 > 0) 
and, 

Og = {0:0 =U}: 
Note that on Op, 


vni ~ N(0,1) 


Therefore, 


P [vna > Ta| 
a, VO € Og. 


ey 


Hence, y is a—similar on Og. 


Since it is more convenient to work with (9) than with (8), the following result plays 
an important role in the determination of UMPU tests.The proof of the result will be 


discussed later. 


Result 1. If power function of a test y is continuous, then unbiasedness of a level a 


test y implies, its a—similarity. 


Tests for One Parameter Exponential family of distributions We have already 
seen that in case of one parameter exponential family of distributions one can get 
UMP test for (i) H : 0 € bo against H4 : 0 > @ by applying Karlin-Rubin theorem. 
In can be shown that (see Lehmann) there exists UMP test for (ii) H : 0 < 04 or 0 > 
05(0, < 05) against H4 : 6; < 0 < 05 by using GNP lemma. If T(x) is sufficient for 0 


then the UMP test of size a for testing problem (ii) is given by 


polz) 1 if ki < T (x) « ko 


= 0ifT(x)< kı or T(x) > kə, 
where k's and y's are determined by 


Eo, p(X) = Eoy X) =a 


But there does not exist UMP test for (iii) H : 04 < 0 € 05 against H4:0 < 0, or 0 > 
05 and for (iv) H : 0 = o against H4 : 0 # 0o. For the testing problems (iii) and 
(iv) we can find UMPU tests by using the GNP lemma. We shall now consider the 
testing problem (iv). To find a UMPU test for (iv) let us define 


U(a, 0o) v le: Ee <a and, Ego > a, VO # Oo}, 


class of unbiased test at level a for testing the hypothesis under study. It can be 
shown that (see Lehamnn) the power function Egy of any test y(X) for the one 
parameter exponential family is continuous and differentiable. So the unbiasedness 
condition (8) implies Egy has minimum value at 0 = 09 and hence Ee, = a and 
4, = 0, where, 
Ea? = S Exglonty 
Now we define, 
D(a, 09) = le; Eg, =a, and, Eg P = 0) . 
Result 2. Under the assumption Egy is continuous, U(a, 05) C D(a, 6). 
Proof: 
€ U(a, 0o) & Eoy € a and, Egp > a, V0 Æ 0o}. 


Therefore, 


Ea, =a, and, Fay = 0. 


6 


Hence, y € D(a, 0o). 


Result 3. Suppose q? is the best test within D(o,09) that is y? maximizes 
Eg, VO; x 0s. S.t., 
Ea, = o and Eg P zu. 
Then, v? € U(a, 8o). 
Proof: Let, 


P(x) = a, Vx 


Eo, io^ = a 
Eo, (p* = a 
Epp = 0. 


Therefore, y* € D(a, 8o). Hence, Eo y? > Ep y* = o, V€1 # 0s. Also, Eoy? = a > 
Eo, > Ep, V0, # Oo. Hence y? € U (a, 8o) and it is UMPU at level a for testing, 


H : 0 = 0 vs. Ha: 0# 8g. 


Remark From Result (3) we see that in case of a OPEF a UMPU test for H : 0 = bo 
against H4 : 0 Z 05 can be obtained by determining a UMP test within the class 
D(a, 0o) . 


Statistical Inference I 
Locally most powerful tests 
Shirsendu Mukherjee 
Department of Statistics, Asutosh College, Kolkata, India. 


shirsendu.st @yahoo.co.in 


So far we have treated the testing of one-sided and two-sided problems for a real pa- 
rameter when the distribution of the random observable X is sufficiently well behaved, 
that is, a one-parameter exponential family or a family with monotone likelihood ra- 
tio. In this module we consider the problem of finding optimal tests for distributions 
without a monotone likelihood ratio and which does not belong to one-parameter 
exponential family. Among the various other methods of available in the literature 


we are going to discuss the locally best tests of Neyman and Pearson. 


One sided tests 
Let X be a random variable with p.m.f/p.d.f. pg(r),0 € O C R. 
Consider the problem of testing, 


H : 0 = bo vs. H4:0 >b 


Definition. A test q? is called a Locally most powerful (LMP) test at size a for 
testing, 
H : 0 = bo vs. Ha i9 >b 


if for some e > 0, 


(i) Leolo) =a 
(ii) B4,(0) > 8,(@) for all 0 € (05,09 + €) and whatever q satisfying (i) 


where 8,(0) = Egy denotes the power function of the testy. By Mean value theorem 
BelO) = Belbo) + (0 — 09)8,(0*), 0o « 0* < 0. 


1 


Assumptions 


(i) {,(@) is continuously differentiable in the neighborhood of 6, for every q. 


(ii) B,(0)— J PELO 


Our problem is to minimize (8o) subject to, 8,(0o) = J epe, (x)dx = a. Using 


generalized NP Lemma we get the optimum choice of ọ as, 


Ope(a) 
773 > kpo (z) 


i .; Opo(x) 
s ^ wei 7 


Opo(x) 
6, < kpo (£) 


œ = lif 


= kpg, (x) 


= Oif 


if O log pe(x) 
ðo 
., O log pe(x) 
= xr) if — A =k 
q(x) 6, 
= n zg Alog polz) 


gy —1 >k 
<k 


where, k and y are such that, Eg,q = o. 
Example. Let X1, X2... , Xn ~ Cauchy(0,1). The joint density is given by, 


Consider the problem of testing, 
Ho : 0 = 0, vs. H; : 0 >b 


LMP test given by, 
. Olog po(x) 


0 - 
ux) = TM 00; >k 
u ., Olog pe(x) 


Therefore, 


and, 
2X 1 
Vi(———lj-z 
G +X 2’ 
since, X has a symmetric distribution about 0. So, by CLT we have, 
2 in TF 


"f = NT). 


Hence for large enough n, we can choose, c = 744/n/2. 


Drawback. If a < 1/2, for c > 0 fx 2) T > c} is a bounded set i.e., 


D >e] = 2 sepie < œ. 
i=1 1 i—1 


Therefore, the power function, 

1 1 
< I dx 
< Ion LI ETTEN 


i=l 


5. C 


Xi " 
1+ X2 


na 


n 
i—l 


1 1 - 
dax, since r? < R > |r| < R 
des T (1 + (21 m 0)?) j 2. ‘ | : 


R-0 | 1 
= / ———— ——dz — 0 as 0 — oo. 
-Rn-6 7 (1+ 2?) 


3 


Thus the power goes to 0 as 6 —+ oo. Hence e? (x) is good at detecting small de- 


partures from the null hypothesis, it is unsuccessful in detecting sufficiently large ones. 


Two sided tests 


The notion of a locally best test may be extended to testing the hypothesis 
H : 0 = bo vs. Hai #0 


by specifying the slope of the power function at ĝo and requiring the second derivative 
of the power function at 05 to be a maximum. Hence we consider only unbiased tests 
with slope zero at 0 . 


Assumptions 


(i) | B4(0) is twice continuously differentiable in the neighbourhood of 6, for every y. 


G) (0) = fp Pas 


(iii) 8"(0)— Á Qa PE a, 


By Mean value theorem 


(0 — 69)? 


BelO) = B, (89) + (8 = 60)87 (8o) + ^—7,——8,(6"), min(8o, 8) < 0* < maz(8o, 0). 


Locally best unbiased tests 


A test is called locally best unbiased test at size a for testing, 
Ho : 0 = bo vs. H,: 04% 


if out of all y satisfying (0) = a and 8} (00) = 0 the test y? maximizes the value 


of the second derivative at 09 , that is, 


Boo (09) 2 B, (0o) 


. ’When such a test. is unique, the optimum property of a locally best un- biased 
test may be stated as follows: 


A test v? is called LMP test at size a for testing, 
H.0 = 0s vs. Hy: OF 05 


if for some € > 0, 


(i) Boo (00) =a 
(ii) By(8) 2 B,(@), VIO — o| < €, 0 7^ 0o 


where, 5,(0) is the power function of the test y and q satisfies (i). Our aim is to 


minimize 
O?pe(x 
J d sh Jia 
subject to, 
! uelis 
and 
Ope(x) 
[eus n 
o? Ope (ax 
qo(x) = 1 if pra) le-e, > kipo lx) + kp PAD), 
"p. o 
0 if Patt) lo=00 < kipey(x) + LO 


and, Foy? = o and Ej, q? = 0. Note that, 


Ologpe(r) ^ 1 Ope(x) 
90 polz) 00 


and, 


Therefore, qc? (x) can be rewritten as, 


9* lo z ð lo T£ : 9 lo z 
polz) =1 if li + | - Ioann > kk Pu es, 
., O° log pe(a) O log po() i O log po(x) 
0 if d lae + lias ] < kı + Ko E? lo=00 
Example. Let X4,..., Xn ~ C(0,1) To test, 
Ho : 0 = Q vs. Hy:8 30 
Since, X; ~ C0, 1), the joint density is given by, 
(x) 
ues 
PP m IRL + s 09) 
and also, 
Ólgpe(r), AE Xu-8 COR 2n 
00 la-o 3 1+ (a; — 6)? lü-o = 2 1+? 
9? log po(x) 2. —[1 + (x; — 0)?] + 2(2; — 0) E Dae = 1 
OPO) uma — 9 gg xe 
jo? =o 2. [1 + (x; — 0)2]2 lomo = 25. 


Therefore, LMP test is given by, 
polz) = 1 if v > kı + kou 
0 if v < kı + kou. 


f 2x?—1 
where, u = 374 Tin and v = 25; ma Also, kı, k2 are such that, Eg-oy® = 


a, and E4_,y° = 0. Observe that, Egy = f ppolx)dx. Hence, 


A la-o = J Pe E apal, 
and 
p, (00) = 0 + Eo yU = 0. 
Now, 


Eo p?U = Eg UI (V > kı + k3U) 
= EUI (U >0,V > kı + kU) + EQUI (U < 0,V > kı + kU) 
= E«UI(U »0,V > ky + KU) — ESUI (U > 0,V > ky — keV) 


= — EUI (U > 0, kı — kaU < V < ky + kaU). 


6 


Therefore, E,p°U = 0 = ky = 0. Hence the above test is given by, 


polz) =1lifu> ky 
iue. 


where kı is such that, Eoy? = a. 


Statistical Inference I 
UMPU tests for multi-parameter exponential family-I 
Shirsendu Mukherjee 
Department of Statistics, Asutosh College, Kolkata, India. 


shirsendu.st @yahoo.co.in 


We have already discussed the construction of UMPU test for the distributions be- 
longing to one parameter exponential family. In case of multi-parameter exponential 
family the number of parameters involved is more than one, and we are concerned 
with testing a hypothesis relating to one but not all of them, the other parameters 
left unspecified by the hypothesis. Such parameters are called nuisance parameters. 
For example, in the testing for the mean of a normal population, the unknown vari- 
ance is a nuisance a parameter. One method of overcoming the problem of nuisance 
parameter in hypothesis testing is to use similar tests which have been discussed in 
Module-33. In the present module, first we are going to re-introduce the concept of 
similar tests and then we shall introduce another important concept, viz., tests with 
Neyman structure. 

Let X = (X4,..., X4) ~ pe(x) with 0 = (01,05,...,0,) € R^, k > 2. Our problem is 
to test, 

H:0cO0g vs. Ha: 6 € On, (C O- On). 


where Og is the set of all points which are limit points of both Oy and Oy, i.e. 
Op =OnN Ong, A is the derived set of A. It is the set of all points on the common 
boundary of Oy and Op,- 
Similar test: 
A test y is called a—similar on Og if Ego =a, VO € Og. 
Example 1. Let Xj, X5,..., Xn ~ N(u, 0°), u.c? are unknown. Here 0 = (061,05) = 
(u,a), k = 2. To test, 

H:u<0 vs. HA:p O0. 


p(x) = lif, — > taa 
S 


= 0 if, | 
Here, 
On = {(H,0); < 0}, 
On, = {(u,0); p > Of, 
and, 


Note that on Og, 


and 


Eg% = PLE > tanl 
S 


I 
Bg 

< 
SS 
M 
® 
m 


Hence, y is a—similar on Op. 


Result 1. If power function of a test y is continuous, then unbiasedness of a level 


a test y implies, its a—similarity. 


Proof. Suppose, 0* € Op,. Then there exists a sequence {0n}, On € Ox, such that 


0, —> 0*. Since Eg() is continuous, Eg, (p) —9 Ee(p); and since Eg, (v) < a 


for 0, € Og, Elp) € a. Similarly, there exists a sequence Bn, On € Og, such 


that E; (p) = a (since ọ is unbiased) and 6, —. 0*. Thus Eg (p) — Eo-(y)), and 


it follows that Eg«(q) > a. Hence Epy = a, for all 0* € Og and g is a-similar on Op. 


Result 2. If power function of every test is continuous and level a test ° is 
UMP within in the class of a—similar test on Og for testing H : 0 € Oy against 
H4:0 € Og,, itis UMPU at level a for testing the same hypothesis. 

Proof. Define, 


Cs = (y; Eoy = o, V0 € Og) = class of all o similar tests on Og, 
and 
Cu = {y; Eoy € o, V0 € On, Egy > 0, VÓ € Og, ) = class of all unbiased level a tests. 


From Result- 1, we get Cy C Cs, y? be UMP with Cs. Therefore for every 6 € Op, 


there exists a 6, such that, 0,(€ Op) — 0 € Og, since @ is a cluster point of O,, 


— Es, o — Egy? =a 


=> Fo, g? — a. 
Consider, y* = a, Vx, 


Š a V0 € Op — v* € Cs 


m 
6 
l 


Egy" > E = o, V0 € Og, 


Hence, ° is a UMPU test. 


Suppose T is sufficient for 0 € ©, then for any test y, we have 


(t) = E [e(X)It] 


is independent of 0 and it belongs to [0, 1]. Hence it would be natural to consider test 
based on 7* only instead of X. This fact motivates us to introduce a new concept, 
viz., tests with Neyman structure, which is defined as follows. 

Tests with Neyman structure: 

Let T be sufficient for Og. A test q is said to have Neyman structure at level a with 
respect T over Og if, 


E(y|T) =a ae. PL, 


3 


where, 
P= { Pa; 0c el ; 
pT = {P7;0 € ok, 
and 
Ph = {P7;0 € OB}. 
Also, E(y|T) = a on a set A, where PF (A) = 0 and F7 € PZ. 
Result 3. If y has Neyman structure at level œ with respect T, then it is 


a—similar. 


Proof. Note that, 


Ep EE |y|T] 


aj. VÓ € Op. 


The advantage of tests having Neyman structure is that on each of the surfaces 
T — t, the distribution of X is independent of the parameter 0. This provides a way 
of getting rid of the nuisance parameters. The following result provides an answer to 
the question "When will similar tests have Neyman structure?’. 

Result 4. Let T be a sufficient statistic for 0 € Op. A necessary and sufficient 
condition for every a—similar test has Neyman structure at level œ with respect to T 
is the bounded completeness of PẸ. 


Proof. (If part): y is a a—similar on Og, 


=> Epp = a V0 € Op 
s Eo [E(y|T) — o] 20 V6 € Op 


=> E(y|T) =a ae. PL 


since P7 is boundedly complete. 


Proof. (Only if part): If possible, PẸ is not boundedly complete. Therefore, there 


exists a bounded function (t) and P € PL such that, Eg (T) = 0, P&(v(T) £ 
0) > 0,. Since Y(T) is a bounded function there exists a positive real number M such 
that |V(T)| € M < oo. Define, y(t) = o + cv(t),c = Min(ol-o) 

Then —cM +a € e(t) € cM +a. 


Now 
. 1 
cM+a = lifa>> 
1 
= 2aifa<— 
aifa 5 
and 
: 1 
—-cM+a = Zy 5 Lita eg 
1 
= Oifa<= 
if a 5 


Thus y(t) is a test function i.e. 0 € y(t) < 1. 


Now 


Ejsg(T) = cEQ(T)- a 


= aq for all 0 € Og 


Hence Y(T) is a-similar on Op. 


But 


Es (e(T)]T = t) p(t) 
= a+ y(t) 


# a with positive probability P5 € PẸ 


So does not have a Neyman structure. Which is a contradiction. Hence P must 


be boundedly complete. 


Statistical Inference I 
UMPU tests for multi-parameter exponential family-II 
Shirsendu Mukherjee 
Department of Statistics, Asutosh College, Kolkata, India. 


shirsendu.st @yahoo.co.in 


In the last module we have shown that if power function of every test is continuous 
then UMP test q within in the class of a—similar test is UMPU at level a for testing 
H : 0 € Og vs. HA: 0 € Og,. In the present module we are going to explore 
this idea for constructing UMPU test when the distribution of X belonging to the 
k-parameter exponential family. 


The p.m.f/p.d.f. of X is given by, 
po(x) = A(0) exp2ist T) A(x), 


where 0 = (61,...,0%) = (01,07), T = (T1,..., T4) = (T, T9), k > 2. Since we are 


dealing with MPEF we can consider the following set of assumptions, 


(i) Power function of any test y is continuous. 

(ii) Dam / ppo(x)du = J 3g, Pe(x)du 

(iii) Conditional distribution of T; |T 9 = 1? 
A(O) exp + Xia h(t) 


A(0) expLi=2 iti f exp%t h(t)dty 
= A(0, 10) exp" H* (ti, t) 


= One Parameter Exponential Family (OPEF) . 
(iv)T is complete sufficient for PZ provided Og contains a 


(k — 1) dimensional rectangle. 
First we consider the problem of testing, 


(1)H :6, < fg vs. Ay: 6, > io, 


1 


Since the distribution of T;|T 2 = t® belonging to OPEF, it possess the MLR prop- 
erty. By Karlin-Rubin theorem UMP test for testing H : 04 = 019 vs. Ha : 04 > 440, 


at level œ based on the conditional distribution of T;|T ? = t) is given by, 


p(t, t9) lift, > c(t) 


y(t) if t, = c(t) 
= Qift; < c(t?) 

where c(t) and 7(¢®)) are such that Eo, hot, tTO = £9 = o ae. PI^, 
Now we have to show that 
(i) q9(t4, 100) is UMP within the class of all a-similar tests on Og = {0 : 0 = O40} 
(ii) y? (t4, t®) is a level a test for testing H : 64 < 019 vs. Ha : 01 > bio, 
Let us define 

Cys = fy : Ey|T® =a ae. PR N 

Cs = (y : Egip = oN0 € Op}, 

and, 


Cu = {o ; Egg < avd c Og, Ego > avl c On,}. 


Now ° € Cyg and hence o° € Cg. Let y be any a—similar test on Og. Since T) 
is complete (hence boundedly complete) for Og. Then any a similar has a Neyman 
structure by result 4, i.e., 


Ey|T? = a ae. pr. 


Since y® is UMP within Cys, 


Ej? = Ego Eo p TO 
2 Ego) Eg, pT” 


= Fey, V0 € Oia 
So y® is UMP within Cs. To prove (i), it is to be noted that 


Es, (e| — 190) t 0; 


= Eo, (copre = t) < Eo (ere == t) = a, VO; < 010 


S Reyes. (Pre = t) <a, V0 € Oy 


=> g € Cr. 


Example 1. Let X = (X4,..., Xn) and X; ~ N(p,07), i 


dently where u, 0? are both unkown. 


1 


2 
20" 11 


p(z, 1,02) = (20)? o7" exp [l-2 (zi — ny. 


Consider the following hypothesiss, 
Ay: p<O0vs. Aya: p> 0 


We can write eqn 3 as, 


A 1 TL TL 
p(slu cafe M¥2n) ^s teo [s go y 
g 1 — 


= (61,0) exp {aT (2) + OT (x) h(a) 


1,2,...,n indepen- 


(1) 


where, 0; = 25, 90 =-34, T(x) = x, and TO (x) = 372.422. Hence H, can be 


rewritten as, H; : 0; = 0. Then, Og = [(6., gi» ig = 0. Note that, if (T1, T?) are 


jointly sufficient for (0, 02?) € ©, then T is sufficient for Op, which is also complete 


and hence boundedly complete, thus the test given by 


olt tO) = 1if, t > c(t?) 


= Oif, t < c(t) 
with 
E {oT TOTP = 19) =a ae. PT 


is uniformly most powerful (UMP) a—similar test. Define, 


Vim pee. Fair, 


Note that, 


Vip TV 


1 VnX /o? 
yaln — 1) YEL (Xi — X)2/(n — Yo? 


~ ty, under Op. 


Therefore, 
[m In > e(T Oro = £9 
= Pao [V T R> c(t) [TO = 2 
= Phs [aces TO yn(n — 1) > ni] 
=a. 


Hence the test, 


X 
e x) «uf, — > tant 


is UMPU size a test. 

In the above example we see that to evaluate the cut-off points of the test we need 
the conditional distribution of T;|T ? = t? which may not be easy to obtain in all 
cases and also it may be difficult to work with them. So, it will be better if those 
conditional tests can be converted into unconditional tests by some means and it is 
possible to do so if there exists a suitable function V = V(T1, T ?))of the sufficient 
statistics (T1, T?) which is independent of T°). Our aim is to find a statistic V such 
that 

(i) V is strictly increasing function of T! for fixed T°). 

(ii)V and T? are independent. 


Then the conditionally best test is given by 
Pw, t) = 1ifu> v(t) 


4 


= y(t) if v = v(t) 
= Oifu< v(t) 
where v(t?) and 4(t?) are such that Eo, [e(V, Tyre) = t| =a. 
Since V and T? are independently distributed the UMPU test is given by 
y'(v) = Llifv up 


= yif qe 


0 if v <v 


where vo and y are such that Eo, [9 (V)] = o. 

Choice of V 

The statistic can be obtained by using the concept of ancillarity which has been al- 
ready discussed in Module-12. We know that a statistics V is ancillary for P (or, for 
Q*) if the distribution of V is independent of any P € P (or, any 0 € O*). 

Basu's theorem 

Suppose T is boundedly complete for O* and V is ancillary for O*. Then for V and 
T are independent on O*. 

In the last example, the distribution of V (T,, T?) is independent of O5 and T® is 
complete sufficient for Op. So V(Ti, T (2) and T are independent. 


Example 2. Let X = (Xj,...,X,) and X; ~ N(u, 0°), i = 1,2,...,n indepen- 


dently where 4, 0? are both unknown. 
pis o) = Qn) te | s Ye ca] i) 
Consider the following hypothesis, 
Hytg^ <a, vs. H24: 0° > oê 
Similarly in this case, we can rewrite eqn 3 


E E 1 2 u n nu 
p(x|u, a?) — (27) 2g apf- yt ka Eh 


where 6; = —i (4 — E pu = 4, Tæ) = E a?, TO(£) = Di r Hence 
Op = {(61,0) : 01 - 0], and Hy & Ho: 01 < 0 and Hy & Hoy: 01 > 0 UMP 


a—similar test is given by, 


olt tO) = Asif, t, > c(t) 


0 if, ti < c(t) 


with Eg, oo (T1, T2) = a. Under 0 =0 & o? = o2, T = Y? qx; is sufficient for 0. It 
is also complete. Here the corresponding ancillary statistics is, V = Y(X; — X)’. 
[When 0; = 0, X; — X are all identically distributed and are independent of all 
nuisance parameters. | By Basu's theorem, V and T are independently distributed 


whenever 0; — 0. Define, 


n A TO 
WX ORY = Ti — —— t ITO ~ — da 
i=l 
Hence, 
—1)s? 
p(ti,t ) = 1 if, l 2 ) > Xd 
00 
— Ds 
— 0 if, e < M oni 


0 


Example 3. Let X4 ~ B(n1,0,) and X5 ~ B(n2,62) are independent. Consider the 


following testing problem for hypothesis, 


H : 0, = bə vs. Ha: 0 > 65. 


palse) = (2) (era - oopa cor 


X1 X2 
niVím 
= ("") ("") exp {217 + (z1 + %2)6}, 


6 


T= (log ie. — log 3.) = log( odds ratio ) and 0 = log .?2.-. Define, S(x) = X, and 


1—02 
T(X) = Xı + X». An UMPS level a test is given by 


qst) = Idfs- e(t) 


a(t) if, s = c(t) 


= UNE s< c(t) 


where c(t) and a(t) are such that, E,-9y(S,T) = a for all t. Under 7 = 0, SIT =t 
has the following p.m.f., 


Statistical Inference I 
UMPU tests for multi-parameter exponential family-III 
Shirsendu Mukherjee 
Department of Statistics, Asutosh College, Kolkata, India. 


shirsendu.st @yahoo.co.in 


In the last module we have discussed the construction of UMPU test (J)H : 064 < 
io vs. H4 : 04 > 049,. In the present module we consider the UMPU test for testing 
(II)Ho : 01 = Oio vs. Hı : 6: Æ Oio. Define, ge(t) = c(8) exp} i= ĉit: H(t). So, 
"it? ~ OPEF(60). 
y(t) = lift <a(t®) or, ti > et?) 
= 099) if, t = o(19,i = 1,2 
= 0if, c (t2) < ti < co (£0), 


where c1, C2, %1, Y2 are such that, 


Eg plt” =Q, (1) 


Es Tu |t = aED [t, (2) 


«is conditionally UMP (given t9?) in CU (a, 610) = {y : (1) and, (2) are satisfied. } 
To show that c? is UMPU (unconditionally) at level a. Let y be any unbiased 


test for testing 


Ho : 01 = 6019 VS. Hi : 0, Æ 019 


y is unbiased > y is œ — similar on Opg (by result 1) . 
=> ( has a Neymar Structure with respect T°) over Op. (by result 3) . 
=> Esel? =a ae. 


1 


y is unbiased [i.e., Eg has a minimum at 0 = 4.| 


OQ Eg o 
06 — 


Now, 


Ep = ZO H (t)dt 


Oc(0) 
o Eep 00 
— E. T 10 F = 0 
610 01071 F c(810) oP 
or Eo, 116 = Eo, Li Eoo = 0. (3) 


Therefore, LS =0, 


i Oc(0) 
3S no expt}; G H (t)dt £ ON ] &«9 exp2ciai Oiti son 01-610 = 0 
CAU10 
Oc(0) 
à 
e Folie F er hac] a —0 


e Fo Tig —oEg T) —0 
€ Eo, [Erro (Ty — Eg, 11) d Te) = 224] —0 
e Eniro (Ti — Eo, T1) eio —É 


S Epo (Tit) o Eo, (Tit) (4) 
Hence, y € CU (a, 019). So, 
Egy [9 > Egy|t V iO. 0 such that 0, A 019 


or, 


Ego > Egg V 0 such that 01 fp 010. 


But q is any unbiased level a test. Hence y is unconditionally UMPU at level a for 


testing, 


Ho :041 = 619 VS. Hi : 0, Æ O10 


Alternatve respresentation. 
Find a statistic V = f (T1, T?) such that, 
(i) V and T® are iid. under Og = (06;0, = 610} 


(ii) V = a(t) + Tyb(t9)) for each fixed tO &b(t?) > ovt». 


Therefore, 


t4 « cy (£9) e v « ctt?) 


and, 


ti > c(t) e v > ct 
Hence the alternative UMPU test is given as, 
(v) = lifv«corv»o 


= Vi if, v = = 1,2 


Oil Gy <v < c, 
where, c1, C2, %1, Y2 are such that, 
Eg, p (V) =a 


and, 
Fox Vo (V) EE ake, V. 


Choice of V 


(i) V is ancillary for Op 


(ii) V = a(t®) + T,b(t) i.e. V is linear function of T4. 


3 


Note: 


If V is symmetric about a under Og the UMPU test can be given by, 
q"(v) 2 lif, ju—al>c 
yif, |v—a|-c 
0 if, |v—-a| «c, 
where, Eg, (V) = a. 


Example: 


Ex 1. Let X,,..., X, ~ N(u,o?). We are interested in testing, 


(i)Ho : 0 = oo vs. H1:0 Æ oo 


and, 


(ii)Ho : u =O vs. Hı: p #0 


Solution (i) 


= 2 
po(x) = (27) n/25 " exp uro i-is . 


Let us choose, 


1 uod 
04 ~ 9g2 and, Ti = 2A 
and, 
0, = E and, TO = nX 
Define, 
2 
T2 
n 
where a(t?) = = and b(t?) = 1, a linear function of T, for each fixed t) also V 


and T? are independent. Then the UMPU test is given by, 


"E v 
qv) zip — Up DE. 
00 00 


, U 
0 if, Gp. -5 < C2, 
00 


where, c; and c» are such that, 


Eje (V)-a 


and, 
Ej Vo (V) = aE V 
Thus, 
C1 2 C2 
P|—« «-—5;|—-1-ao« 
o2 Xn=1 a 
and, 
C1 2 C2 
P|—« «-2|—-1-—-a« 
o? Xn+1 o2 


Solution (ii) Now let us now re-define, 


0, = £ and, T, = nX 
c 


and, 
05 cos and, TO = 3567 
i=1 
Define, 
T 
V= “ 
y (TƏ — È) (s — 1) 

is an ancillary for Og = (06; u = 0}. Define, W = DS is ancillary for Og since it is 


independent of o and linear in T; and W is symmetric about 0. The UMPU test is 
given by, 


e (w) =1if, |w| >c 


0if, |w] < c, 


where, E,,,p°(W) = a. 


Ex 2. If (X,Y) ~ Nə(m, fo, 07, 0%, p) and let (X1, Yi),..., (Xn, Yn) be an iid 
sample from (X,Y). We are interested in finding an UMPU test for, 


(i)Ho:p <0 vs. Hi: p>0 
and, 
(ii)Ho : p=0 vs. Hi ipsu 
Solution Let 0 = (m, 13,02, 03, p) and Og = (0; p = 0}. Now the joint density is 
given as, 


p(x,y) = (2n) "P (ooy 1 — p?) " 


1 W EA QP (910—822. 420 n ! ; 
(1—p2) P» t 71 ) 2» 02 ) 7172 Xa Gioia iua] 


X exp 
= f(8) expe DS TH 73535 S m 21 BaP pam y? , 
So let us now define, 0; = Ga and T, = Y? , X,Y; Also TO = (Z2, X2, Y, Y2, X, Y). 


So accordingly, we can define, 


2 E(X X)(Yi - Y) e a (T9) 
VEX- X? naQ -Yy gY(T)) 


and linear in t; for each fixed t?). Hence UMPU test for, 


V Tt, 


Hy: p<Ovs. H,:p>0 


is given by, 
0 yn — 2v 
v) =1 if, ————- > tan 
yn —2 
0 if, E a < tasn—2; 


and the UMPU test for, 


is given by, 


Statistical Inference I 
Theory of Confidence Sets 
Shirsendu Mukherjee 
Department of Statistics, Asutosh College, Kolkata, India. 


shirsendu.st @yahoo.co.in 


Introduction 

The idea of interval estimation can be extended to include simultaneous estimation 
of several parameters. For example, two parameters (u,0?) of a N(u, o?) may be 
estimated by some subset S of the parameter space © = R x Rt. A (1 — a)96 
confidence region is a subset O such that if samples were repeatedly drawn from the 
population and a region constructed for each sample then about (1 — a)% of those 
regions would include true parameter value. So our discussion can be made somewhat 
more general if we speak in terms of confidence sets rather than confidence intervals. 
Confidence set estimation 

Let 0 be the parameter of interest and O is the parametric space. While X is the 
random vector and x is the support. Let also Py be the probability distribution of X. 


Definition.(Confidence set) 


A family of subsets S(x) of O is said to constitute a family of confidence sets for 0 at 
confidence level (1 — a) if, 


P [0 € S(x) 21—o 


Special cases: 
I. Lower confidence interval. Lower confidence bound provides just a lower 


bound for the parameter of interest. 
P, [0 > 8(x) > 1-a, WEO 


Confidence set = [0(x), oo) and S(x) = {0: A(x) < 0}. 


Example 1. Let X4,..., X,  N(p, 1). Since we know, X ~ N(p, J) Therefore, 


P,|vn(X - n) < ta] =1-a WER. 


Hence by interchanging sides, 


To, 
Jn 


II. Upper confidence interval. Upper confidence bound provides just a upper 


P pox = 


[-1-awer 


bound for the parameter of interest. 
Py |0 <a) > 1-a, voco 


Confidence set = (-oo, 6(x)| and S(x) = {6 : O(x) > 0}. 


III. Confidence interval. Confidence interval provides both upper and lower 


bound for the parameter of interest. 


Pp |O(x) <0 < O(a)|>1-a, WEO 


Confidence set = (A(x), O(x)| and S(x) = [o :O(2) <0< 6(x)}. 


Relation between confidence set estimation and testing of hypothesis. 
There is a very strong correspondence between hypothesis testing and interval es- 
timation. In fact we can say in general that every confidence set corresponds to a 
test and vice-versa. We next consider the method of test inversion and explore the 
relationship between a test of hypothesis for a parameter 0 and confidence interval 
for 0. 

Result : Let for each 05 € O, A(05) (C X) be the acceptance region of a level a test 
for testing Ho(0) : 0 = 09. Then, 


S(x) = {0 : x € A(8)] 


is a confidence set for 0, at confidence level 1 — a. 


Proof. Let R(05) be the rejection region of the level o test mentioned above for 


testing H (05) : 0 = 05. We know that A(05) U R(05) = & (sample space). 


Pa [X € R(069) € a V0; cO 
or, P. [|X € R(0)) € a WEO 
or, P. |X € A(0) > 1-a V@EO 
or, Pp |X e S(X)]| > 1-a VOEO 


Example Let X = X;,..., Xn ~ N(,07). Suppose we are interested in the fol- 
lowing testing hypothesis, 


Ho: p = uto vs. Hi: p puo. 


We know that, we reject Ho at level a if vas 


R(uo) = |: : Ah > tanama} : 


> dass Therefore, 


n|X — 
nie G as > ian} 
n|X — 
A(n) = |: : m < iae = {z : pe S(x)}, 


where S(x) = {u : x € A(u)} Hence, 
x y 
A(p) = I: Sod Ja elm Spe atona] ; 
and 


_ zx = x 
S(x) = E > —=ta/2,n-1; + atena E 


vn vn 


Confidence interval for discrete case 

Let X = (X,,..., Xn) be random sample of size n drawn from a discrete distribu- 
tion with p.m.f. p(x,0). To determine a 100(1 — a)% confidence interval for 0 we 
may use the distribution of the sufficient statistic T' for 0. We consider the follow- 


ing result without proving it. For a proof of the result one may see Casella and Berger. 


Result Let T be a discrete statistic with c.d.f. F(t) = Po(T < t). Let a; and ag 


be two fixed real numbers such that o + a = a. Suppose that for each t, 0(t) and 


0(t) is defined as follows: 
i) If Fy(t) is a decreasing function of 0 for each t, define A(t) and 6(t) by 


Poe) (T < t) = o, and Pact) (T > t) = Q9, 
(ii) If Fy(t) is a increasing function of 0 for each t, define 0(f) and 0(t) by 


Then the interval [6(t), 0(t)] is a confidence interval for 0 with confidence coefficient 


(1 — o). 


Example Let X;, X5,... X, be a random sample of size n from Poisson(0) dis- 
tribution. Here T = X; X; is sufficient for 0. Since T ~ Poisson(n0), the c.d.f. of 
T is a decreasing function of 0. If T = t is observed, taking o, = ay = $, we have to 


solve (t) and 6(t) from the equations 


e" O(n6(t)! a 


e a = 
» ti cg b t! ~ 2 


Now we know that if X ~ Poisson(A) then for any non-negative integer k 
P(X < k) = P(Y > 2A), where Y ~ X201) 


4 


Hence 


=> (t) = 2X2 


Again 
oo eO (ng(t)! oa 


t! 2 


i=t 


=> Pd S 2n8(0) = $ 


= 2n60(t) = Xs 


1 


= t)- 2, Gu-$- 


Hence [+ x3,4_2, DR 41),2] is confidence interval for 0 with confidence coefficient 
? 2 72 


(1 — a). 


Optimum property of confidence sets 
Since there is a one-to-one correspondence between confidence sets and test of hy- 
potheses, there is some correspondence between optimality of tests and optimality of 
confidence sets. Usually test related optimality properties of confidence sets do not 
directly relate to the size of the set but rather to the probability of the set covering 
false values. The probability of covering false values or the probability of false cover- 
age, indirectly measures the size of the confidence set. Intuitively, smaller sets cover 
few values and,hence, are less likely to cover false values. An optimum property of 


confidence sets is given in the following definition. 


Uniformly most accurate confidence set (UMA) 


A family of confidence sets S?(r),z € Æ is said to constitute a family of UMA 


confidence sets at confidence level (1 — o) if 
(i) Py (0 € S°(X)) >1-a v6 c6 
(it) Py (0 € S°(X)) < P,(P € S(X)) voz 9 


and whatever S(X) satisfying (i). 


A 1— o confidence set that minimizes the probability of false coverage over a class 
of 1 — a confidence sets is called a UMA set. UMA confidence sets are constructed 


by inverting the acceptance region of UMP test,as we will prove below. 


Result: Let for each 4) € ©, A?(05) be the acceptance region for a UMP level a 
test for testing Ho(06) : 0 = 09. Then, 


P(e) = fo; tE A*(0)) 


is UMA at confidence level (1 — a). 


Proof. Let R?(05) be the rejection region for the UMP test mentioned above for 


every 05 € O. By the previous result, 


P(0€S(X) > 1-a VEO 

Py |X € R(0)| > Pa[X € R(8o)] VO A 6o 
Py |X € A°(G)| > Po [X € A(0)) VO A 6o 
Py [o € S°(X)] > Pes(x) veze. 


and whatever S(x) satisfying P)(0 € S(X)) > 1— a. 


Example 1. Let X = (Xi,..., Xn) ~ N(0,1). Rejection region for a UMP test of 
size a for testing 


Ho : 0 = bo vs. H, : 8 > bo. 


6 


is y/n (X — 0) > Ta Where Ta is the upper a point of standard normal distribution, 


and define the acceptance region, 
Ta 
A(&i) = C TM T 


and, 
S(x) = [se >7-— AX 


Hence the UMA confidence interval is given, 


Note The more common two-sided interval E — A. E+ A . is not UMA since 
it is obtained by inverting the acceptance region of the from the test of H : 0 = 4 


against H4 : 0 Æ Oo for which no UMP test exists. 


Example 2. Let X = (X4,..., Xn) ~ R(0,0). UMP test for testing 
Ho : 0 = 05 vs. Hı : 0 4 0o. 
So we reject Ho(09) if z(4 > 09 or, £in) < oat”, and define the acceptance region, 
A(8) = [xi boat” < Tín) = 6o) 


and, 


S(z) = LET < Oy < SR j 


Hence the UMA confidence interval is given, 


T(n) 
cor atn i 


Statistical Inference I 
Theory of Unbiased Confidence Sets 
Shirsendu Mukherjee 
Department of Statistics, Asutosh College, Kolkata, India. 


shirsendu.st @yahoo.co.in 


In the last lecture we have studied test inversion as one of the methods of constructing 
confidence intervals. We showed that UMP tests lead to UMA confidence intervals. 
In Module 32 we have seen that UMP tests generally do not exist for two-sided al- 
ternatives. In such situations we either restrict consideration to smaller subclasses of 
tests by requiring that the test functions have some desirable properties, or we restrict 
the class of alternatives to those near the null parameter values. The construction of 
unbiased tests and locally most powerful tests were already discussed. In this lecture 
we discuss how the concept of unbiasedness can be used in constructing confidence 


intervals. 


Unbiased confidence set 
A family of confidence sets S(r),z € A for a parameter Ó is said be unbiased at 


confidence level (1 — o) if 
())P$&(0€S(X)) 21-a VOEO 
(7) Po (0 € S(X)) 1—o V0 Z0 


If S(x) is a is an interval satisfying (i) and (ii), we call it a (1 — o)-level unbiased 


confidence interval. 


Unifromly most accurate unbiased (UM AU) confidence set: 
A family of confidence sets S?(r),x € X for a parameter Ó is said to constitute 


a family of UMAU confidence sets of confidence level (1 — a) if it is unbiased at 


confidence level (1 — o) and 
P; (€ € S(X)) < Py (6 € S(X)) vez 9' eo. 
where S(X) is any other unbiased confidence set at confidence level (1 — o). 
Remark A a family S(X) of confidence sets for a parameter @ is unbiased at level 
1 — a if the probability of true coverage is at least 1 — a and that of false coverage is 


at most 1 — o. In other words, S(X) traps a true parameter value more often than it 


does a false one. 


Result 1: Let for each 0) € ©, A?(05) be the acceptance region for a UMPU level 
a test for testing H (09) : 0 = 0o against H A(09) : 0 4 0y Then, 


S°(x) = (6; € A*(0)) 


is UMAU at confidence level (1 — a). 


Proof. To see that S?(x) is unbiased, we note that since A?(0) is the acceptance 


region of an unbiased test, 
P,(6’ € S°(X)) = Py |X e A(9)) x 0 — a). 


We next show that S?(X) is UMA. Let S(x) be any other unbiased (1a)-level family 
of confidence sets, and write A(0) = {x : 0 € S(x)). Then Po [X € A(0)] = P»(0' € 
S(X)) € (1— a), and it follows that A(0) is the acceptance region of an unbiased size 


a test. Hence 


P,(6' € S(X)) 


Py [|X e A(0)| v0 € © 


IV 


Pa |X e A*(9^)| v6 A 6s 


P,[f e S(X)] VO £ bo. 


The inequality follows since A°(@) is the acceptance region of a UMPU test. This 


completes the proof. 


Example 1. Let X = (X1,..., X4) ~ N(0,1). Rejection region for a UMPU size a 
test for testing 
Ho: 0 = bo vs. Hi : 0 # 0s. 
is yn |x — uol > Taj2 Where Ta is the upper a point of standard normal distribution, 


and define the acceptance region, 


and, 


Example 2. Let X = (Xi,..., Xn) ~ N(u,o?) where both u and c? are un- 


known.Rejection region for a UMPU size a test for testing 


Ho: p = po vs. Hı : po. 


is yn |X — o) > Stn—1,a/2 Where t, 4,4 is the upper a point of t-distribution with 
(n-1) d.Land s? = — 72 4(X; — X)’, and define the acceptance region, 


E - E 
A(09) = {x Ho — a <T < pot Aen] 


and, 


[o 


Hence the UMAU confidence interval is given, 


Example 3. Let X = (X4,..., Xn) ~ N(0,0?). The UMPU test for testing, 
Ho : o = oo vs. H1:0 Æ 09. 


is given by, 


(»U VU 
Lif — < G& or = >C 
00 00 
= 0 otherwise, 


ev) 


where v = Y (x; — Z)?. Therefore E c x? , under Hp. Hence, the acceptance 
0 


region for the UMPU test is, 


Wi (25 — z) 


Cit 
oô 


< ©, 


where c, and c» are obtained from the following conditions, 


())E,e(V) =a 
(i) Ej2V e(V) = aE 29(V). 


By inverting the acceptance region the UMAU confidence interval for ø? is given by 


Xem pm 


Co C1 


Example 4. Let X = (X1,..., Xm) ~ N(0,0?), and Y = (Yy,...,Y4) ~ N(u, 07). 
Consider the parameter for difference of mean 6 = 0 — u The accepting region for a 
UMPU test for testing, 

Ho : 6 = ôo vs. Hı : 6 Æ ôo, 


is 


J1 1 q1 1 
A(09) = [s do = $4 — —tmin—2,0,/2 < T < do FS | — + — 
m n m nm 


ie 241—343 2 A i—y 2 
where S? = homer a) Hiat 3»! adi 


E 1 1 H 1 1 
S(x) = 4x; T — 54 — + —tminoaja S Ò S B+ 84] — + —tmin2e/2 f - 
m n m m 


Hence the UMAU confidence interval is given, 


E 1 1 7 1 1 
z — s\|— + —tmin—2,a/25 ie | ra ~bm-+n—2,0/2 ` 
m n He M 


Relation between shortest length confidence set and UMA confident set: 

If the measure of precision of a confidence interval is its expected length, one is natu- 
rally led to a consideration of unbiased confidence intervals. Pratt (1961) has shown 
that the expected length of a confidence interval is the average of false coverage prob- 


abilities. 


Result 2 Let © be an interval on the real line and p(x, 0) be the p.d.f. of X. Let 
S(X) be a family of (1 — o)-level confidence intervals of finite length. If the length 


of a confidence set S(x) is A(S(r)) = Lebesgue measure of set S(x) then 
Ej [ANSE] = f Po (0! € S(a)) d£ 
Proof 


A(S(a)) 


l 
I 
M 

jn 

S 

a 
E 


Therefore, Es [A(S(x))] 


I 
— 
Ed 
U 
o 
z 
8 
s 
£. 
8 


l 
E 
E ————- 
I 

M 
a 
Z 
&, 
E 
LLLI 
"3 
— 
S 
D 
Nr 
£&, 
8 


= J Į p(x, eye do’, By Fubini's theorem 
6'ES(x) 


5 


J P, (8' € S(a)) do" 


n ,, P e S(a)) ar. 


Remark If S(X) bea UMA confidence interval then it is also the uniformly short- 
est expected length (USEL)confidence interval. Similarly UMAU confidence interval 
is equivalent to the uniformly shortest unbiased expected length (USUEL)confidence 


interval. 


Randomized confidence set 

The idea of inverting acceptance region to obtain the confidence set can not be directly 
used for discrete cases, since we have randomized test. In fact, in discrete problems 
inverting acceptance region of randomized tests may not lead to a confidence set with 
a given confidence coefficient. Note that randomization is used in hypothesis testing 
to obtain tests with a given size. Thus, the same idea can be applied to confidence 
sets, i.e., we may consider randomized confidence sets. Suppose a UMPU test for 
testing, 

Ho : 0 = 6 vs. H1 : 0 Æ 6, 


is given by, 


Poo (x) = | if, ti > Ca(t5, 00) Or, ti < cı (t2, 0o) 
= ^i(x) if, ti = ei(z, 0o) 
= (x) if, ti = ca(to, 0o) 


= 0 if, cı (t2, 90) <t < c2(t2, 0o). 


Let U be a random variable that is independent of X and has the uniform distri- 
bution U (0,1). Then the test 2e, (X, U) = Zw, (p0) has the same power as ve, and 


is non-randomized if U is viewed as a part of the sample. Let 


Ay (90) = {(#, UU > oy ()j 


be the acceptance region for ga, X, U). If qo, (x) has size a for all 0o, then inverting 


Au(09) we obtain a confidence set, 


C(X,U) = (6; (x, U) € Av(0)5 . 


Example 5. Let X and Y are independently distributed as the Poisson distribution 
P(A) and P(A2) and suppose we are interested in finding a confidence interval of 
p = àı/à2. Define, Tj = X and T5 = X +Y. We know that T1|T5 = t5 ~ Bin (ts, 12) : 
A UMPU test for testing, 


Ho: p = po vs. Hi: p po; 
is given by, 


Qpo (12,12) = 1 if, ti > ce(te, po) or, tı < e(t, po) 
= (te) if, tı = ci (te, po) 
= Ya(te) if, tı = c»(to, po) 


= Oif, eto, po) < tı < c2(t2, po). 
The acceptance region for non-randomized test is, 
Ay (po) = {(t1;t2, U); U 2 Poo (tr, ta) f, 
and the confidence set, 


C(13, To, U) = 16; (Th, T2,U) € Au(p)}. 


Nonparametric Inference: Module 1: 


What we provide in this module 


e A genesis of nonparametric inference 


Parametric Procedures-The traditional practice with limitations 


e What is Nonparametric statistics today? 
Nonparametric, the phrase 


Nonparametric & Distribution free procedures 


Advantages, Disadvantages & Recommendation 


Software based learning 


e An exploratory example of real field 


1 Co-Ordinator: Dr. Rahul Bhattacharya, Department of Statistics, Calcutta University 


1 A genesis of Nonparametric Inference 


Although statistical methods are used from the ancient times, the discipline of nonparamet- 
ric statistics is developed only in the earlier decades of the nineteenth century. Savage(1953) 
considered the year 1936 as the inception of the subject of nonparametric statistics for the 
publication of the work on rank correlation by Hotelling and Pabst (1936). However, in a 
review work, Scheffe (1943), indicated the presence of the sign test in Fishers seminal book 
length treatment Statistical Methods for Research Workers in 1925. Further works, bear- 
ing the development of nonparametric statistics, include contributions by Friedman (1937), 
Kendall (1938), and Smirnov (1939). But the most significant contribution in this context 
is due to Wilcoxon (1945), who developed the rank based methods for comparing unknown 
distributions. It was perhaps the earliest attempt to derive a statistical procedure, parallel 
to the two sample t test without any distribution assumption. This particular work played 
the key role in accelerating the development of rank-based statistical procedures in the 1950s 
and 1960s. In further works Pitman (1948), Hodges and Lehmann (1956), and Chernoff and 
Savage (1958) investigated the efficiency aspects of rank based tests and ended with the 
promising outcome relative to parametric competitors. These works popularized the use of 
nonparametric statistical procedures among research workers and attracted practitioners of 
real fields. 

For a brief account of the developments up to the recent times in the field of nonpara- 
metric statistics, we refer the interested reader to a special issue of the journal Statistical 
Science which gives exposure of a wide variety of topics. These include articles on com- 
paring variances and other dispersion measures by Boos and Brownie(2004), density esti- 
mation by Sheather(2004), quantile-quantile (QQ) plots by Marden(2004), spatial statistics 
by Chang(2004), reliability methods by Hollander and Pena(2004) and permutation tests by 
Ernst(2004), among others. 


2 Parametric Procedures-The traditional practice with 
limitations 


Most of the traditional statistical tools are based on the parametric assumption: the data 
at hand can be thought of as generated by some well-known distribution like normal, ex- 
ponential or Poisson. The parameter(s) of the distributions are assumed unknown and in a 
parametric inference problem we try to put some idea about the unknown parameter in the 
form of estimation or testing or confidence interval estimation. However, most of the times 
the normal distribution is used as the underlying population. The assumption of normality is 
often justified by the Central Limit Theorem, which ensures closeness to normality of certain 
statistics for large enough sample sizes. Other distributions are also important in different 
fields of applications. For example lifetime of physical systems are often characterized by 
exponential, Weibull or Gamma distributions. In such cases the interest lies in knowing the 
expected failure times and the mentioned distributions are used to characterize the lifetime 
distribution. The lifetime distribution is also of interest in the field of certain medical trials, 
where the goal is to give an idea about the length of the life after certain treatments. Ex- 
ponential, lognormal or Weibull are often used to model the underlying lifetime distribution 
so as to enable appropriate inferential procedures. Again, in the analysis of economic data, 
Pareto or lognormal are the appropriate to capture the features of the income distribution. 

However, in practice, most of the experiments are of complex nature, affected by a number 
of factors and hence the generated data might not be identified by a well-known distribu- 
tions. Although for large data sets, normality assumption can be made for analysis but often 
the amount of data is small and hence requires assumption of an appropriate distribution 
for analysis. In fact, there are no exploratory methods to find the appropriate underlying 
distribution or to differentiate among the available choices. Therefore, a traditional statis- 
tician has to assume a distribution for the data which captures only certain features of the 


data to find without caring for the quality of the inference. 


3 What is Nonparametric statistics today ? 


3.1 Nonparametric, the phrase 


The specific word, Nonparametric has its root in the works of Jacob Wolfowitz (1942) , where 


he said 


We shall refer to this situation where a distribution is completely determined 
by the knowledge of its finite parameter set as the parametric case and denote the 
opposite case, where the functional forms of the distributions are unknown as the 


non-parametric case. 


Therefore, parametric statistics is based on known distributions with unknown parameters 
and nonparametric statistics was defined in the opposite way. Randles, Hettmansperger and 


Casella (2004) echoed the same notion in the statement 


Nonparametric statistics can and should be broadly defined to include all 


methodology that does not use a model based on a single parametric family. 


The limitations of traditional methods facilitate the development of further statistical tech- 
niques, which can be applied regardless of the true distribution of the data. These techniques 


are the well known nonparametric and distribution-free methods. 


3.2 Nonparametric & Distribution free procedures 


Although the terms nonparametric and distribution-free are invariably used, but these actu- 
ally indicate statistical procedures, which are used in the absence of any assumption of the 


underlying distribution. In the words of Bradley(1968) 


The terms nonparametric and distribution-free are not synonymous... Popu- 


lar usage, however, has equated the terms ... Roughly speaking, a nonparametric 


test is one which makes no hypothesis about the value of a parameter in a sta- 
tistical density function, whereas a distribution-free test is one which makes no 


assumptions about the precise form of the sampled population. 


We start with the definition of these two important concepts and identify the differences 


subsequently. 


Distribution Free statistic: Let X;,; — 1,2,,,,n be n variables with the joint distribu- 
tion F, where F € F, a class of joint distributions. Then a statistic T = T (X1, .., Xn) is said 
to be distribution free over F, if the distribution of T' remains the same for every possible 
FEF. 

For example, if F is the joint distribution of n iid normal variables with unknown mean p 
and variance unity. Then the statistic Z = $5; 4(X; — X)? has a y? distribution with n-1 


degrees of freedom and hence is distribution free over JF. 


However, in the above example, the distribution free statistic is actually ancillary in 
parametric terminology, that is, they are independent of the indexing parameter of the joint 
distribution. But they are not nonparametric statistics, since their distributions vary for 
different joint distributions. For instance, the distribution of Z above is no longer x? when 


the joint distribution is different from normal. 


Nonparametric distribution Free statistic: A statistic T = T'(X1,.., Xn) is nonpara- 


metric distribution free over F, if the distribution of T does not depend on any F € F. 


For example, if F is the joint distribution of n iid continuous random variables having 
symmetry at the origin, then the statistic Z = 77 , I(X; > 0), has a binomial distribution 
with parameters n and success probability 0p = Pr(X, > 0). Now due to symmetry at 
the origin, Øp = } and hence the distribution of Z remains the same (i.e. Binomial(n, 1)) 


whatever the distribution F be. Thus Z is nonparametric distribution free. 


However, for the sake of simplicity, we use the term nonparametric to refer statistical 


procedures based on a nonparametric distribution free statistic. 


4 Advantages, Disadvantages & Recommendation 


The main problem with parametric inference is that if the assumed distribution is not correct, 
then all the efforts might went into vein and the best (i.e. efficient) can become the "worst". 


Therefore, nonparametric procedures enjoy the following advantages 


1. Nonparametric procedures are based on fewer assumptions about the underlying pop- 
ulations of the data. Therefore, these procedures can accommodate analysis of non 


normal data. 


2. These techniques are easy to understand and are often easier to apply. Thus unlike 


parametric procedures, the calculations with the complex distributions can be avoided. 


3. Although nonparametric procedures discard a portion of information about the data, 
but the loss in efficiency as compared to the competitors based on the normal parents is 
only nominal. However, such procedures tend to be more efficient when the underlying 


population deviates from normality. 


4. Nonparametric methods are robust in the sense that it tolerates departures from as- 
sumptions. That is, if the distributional assumption is perfect, one can adopt the 
usual parametric procedures. But, if the assumption of distribution is not perfect, 


nonparametric methods are the best alternative. 
But nonparametric methods are not always the most desirable procedures as. 


1. Most of the nonparametric methods use only ranks or signs of the observations dis- 


carding the further features of the data. This makes such procedures less efficient. 


2. Nonparametric methods are usually not as efficient as their parametric counterparts 


when the assumptions are met. 


Although nonparametric methods require fewer assumptions but they are used rarely in 
practice because of the use of limited information contained in the data. Consequently, non- 
parametric methods are primarily recommended in situations when assumptions are grossly 


violated(e.g. when severe skewness in the data is present). 


5 Software based learning 


With the advent of high-speed computing facility and softwares like R, nonparametric tech- 
niques grow in a fast pace. We emphasise on using R for learning purpose. Actually R is 
an open source statistical software and users can obtain the software free of charge through 
the Comprehensive R Archive Network (CRAN). Most importantly, if the required statis- 
tical methodology is not available within R, one can also write codes for his/her purpose. 
Consequently, for data based problems, we not only give the final result but also provide the 


R commands or codes together with the output. 


6 An exploratory example of real field 
Consider the following problem: 


Suppose the average weekly sales of a new cell phone are collected for 12 


consecutive weeks from a particular shop are as below 
52 39 49 33 58 61 44 221 201 133 289 211 


The average weekly sales was 125 units for the last year. Is the evidence 


sufficient to conclude that sales this year exceeds last years sales? 


The usual statistical procedure test in this context is a single sample t test. Specifically, 
if u denotes the true weekly sales, then we are interested in testing Ho : u = 125 against 


Hi: u > 125. We run the test in R. 


>data=c (52,39,49, 33, 58, 61, 44, 221, 133, 289, 211) 


>t.test(data, mu-125, alternative="greater") 


One Sample t-test 


data: data 
t = -0.6138, df = 10, p-value = 0.7235 


alternative hypothesis: true mean is greater than 125 


Thus we get the p value as .7235 and the value of the t statistic as -.6138 with 10 degrees 
of freedom. The tabulated value of a t statistic with 10 degrees of freedom is 1.81 at 5% 
level of significance. Thus we accept the null hypothesis at 5% significance level. Therefore, 
the evidence is not enough to reject the null hypothesis. Consequently, the evidence is not 
sufficient to conclude that there is an increase in the true mean weekly sales. 
But these results are not final because we have not yet checked the assumptions required to 


perform a t test. Specifically, results from t-tests are valid provided 
1. Observations are drawn from a normal parent population, or 
2. The sample size is sufficiently large (say, at least 30). 


In our example, we have only 11 observations and hence t test is valid if we can show 
that the underlying distribution is normal. 
The simplest descriptive method to check the shape of the data is to plot the histogram. If 
the resulting histogram is symmetric and bell shaped, the underlying distribution is taken 


as normal. For the above data we plot the histogram below: 


Histogram of sales 


Frequency 
2 
l 


0 50 100 150 200 250 300 


sales 


Figure 1: The histogram of data 


Clearly, the histogram of sales data is far from being symmetric. This indicates the non 

normal nature of the underlying distribution of the sales data. 
However, for confirmation, we further use a normal Q-Q plot. Normal Q-Q plot is a simple 
and efficient exploratory technique, where the quantiles of the observed standardised data 
set are plotted against the corresponding quantiles of a standard normal distribution. If 
the underlying distribution is standard normal, the result should be a plot of points over 
a straight line. The normal Q-Q plot for the stanardised version of the sales data is given 
below. 

We find from the plot that the observed quantiles are too far from those of a standard 
normal distribution and hence the assumption of normality is not reasonable. Thus t test is 
not trustworthy for the sales data set. 

This particular example exhibits the limitations of a parametric procedure in small samples, 
especially, when the underlying distributional assumption is not satisfied. The above data 
does not satisfy the assumptions required to perform a t test. As a result, here application of 


t test is forceful and hence the results from such a test is inconclusive even with a high p value. 


Normal Q-Q Plot 


1.0 1.5 2.0 


Sample Quantiles 
0.5 
l 


-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 


Normal Quantiles 


Figure 2: Normal Q-Q plot 


Thus there are situations in which parametric assumptions are violated and alternative(i.e. 


nonparametric) methods of analysis becomes the most appropriate and meaningful. 


10 


Nonparametric Inference: Module 2: 


What we provide in this module 


e Different inferential set up 


Functionals 


Plug-in estimators 


Examples 


1 Co-Ordinator: Dr. Rahul Bhattacharya, Department of Statistics, Calcutta University 


1 A short recap 


5o far we have learnt that data can be non normal in real practice. Traditional parametric 
methods then fail to describe the features of the data set. If forcefully the normality assump- 
tions are made, the results can lead to severe error, especially in small sample cases. Then 
an alternative inferential method is adopted, which does not require any particular distri- 
butional assumption unlike parametric methods. These methods are called Non Parametric 


methods. 


2 The set up 


Here the data at hand is x = (21, £2, ..., £n), where x; € (—00,00) is the observed value of a 
random variable X;,i = 1,2,..,n. F(x) = P(X, € z,,..., X, € Xp) is the distribution func- 
tion(DF) from which x is generated. The components of X are not in general independent, 
but we assume independence. ie, F(x) = []j_, F;(z;) for marginal DFs F;,i = 1,..n. The 
pair (X, F) generates the induced probability space (AX, B, P), where X is the sample space, 
B is the o field generated by the all possible subsets of X', and P is the induced probability 
measure. Suppose p(x) is the generalised density of X. In any situation, p(x) is either par- 
tially or completely unknown. In statistical inference, we try to give some idea about p(x) 


or some suitable function or functional of it. 


3 Different inferential set up 


Depending on the knowledge of p(x)(i.e. whether completely or partly unknown), inferential 


set ups are either-i. Parametric,ii. Nonparametric and iii. Semiparametric 


4 Classification 


Suppose p(x) is partially known. We can write p(x) = p(x,0),0 € Q, where p is functionally 
known. 0 (numerical or abstract valued) is an unknown quantity indexing p, often called 
a labelling parameter. Q is the space of all possible values of 0. It is called the parameter 


space. 


Example of parametric set up 


1 If 0 is real or vector valued, the set up is parametric. For example, suppose X; are 
iid observations from a Bernoulli distribution with mean m € (0,1). Then 0 = 7 is 
unknown and real valued. p(x,0) = [];_, (1 — 0)! is a known functional form 


and 0 € (0,1) real valued. Thus the set up is parametric. 


2 Suppose X; are iid observations from a N (p, o?) distribution, u € R and e > 0. Then 


n 1l eT zzi)? 


0 = (u,c) is unknown and vector valued. Here p(x, 8) = [Tii zzz is a 


known functional form and 0 € (—oo, œ0)X (0, o0) C R?. Thus the set up is parametric. 


Example of semi parametric set up 


1 If 0 is partly real or vector valued and partly abstract valued, the set up is semi- 


parametric. For example, suppose X; are iid observations from a density f(r) — 
11. 
2 V27 
p(x, 0) = [[z4 f(x:) is a partly known functional form and 0 = (u, g) is partly ab- 


exp(— EË) + g(x) where g is unknown but known to be continuous. Then 


stract valued and partly real valued. Thus the set up is semiparametric. 


2 Consider the regression set up with one non stochastic covariate. Suppose X; are inde- 
pendent observations from the continuous density fi, i = 1,2, ..,n with E(X;) = a+ 62; 
and Var(X;) = o?. z; are known quantities, i = 1,2, ..,n. Then 0 = (@, 8,0, fi, .., fn) 


is partly abstract valued and partly real valued and hence the set up is semiparametric. 


Example of non parametric set up 


1 If 0 is completely abstract valued, the set up is nonparametric. 
Suppose X;,i = 1,2,..,n are iid observations from an unknown DF F(x), where The 
functional form of F is unknown but known to be continuous. Then 0 — F is unknown 
and abstract valued. Q = is the class of all absolutely continuous DFs. Thus the set 


up is nonparametric. 


2 Consider two independent samples {X;, i = 1,.., m) and {Y}, j =1,..,n}, where X; 
are iid observations from an unknown density f(x) Y; are iid observations from an 
unknown density g(x), where f and g are both unknown but known to be continuous. 


Then 0 — (f, g) is unknown and abstract valued. Thus the set up is nonparametric. 


5 Concept of functional 


In a nonparametric set up, the basic assumption is the continuity of the underlying dis- 
tribution F. The unknown quantity of interest (parametric function in parametric set up) 
is defined in terms of a functional. If F is the class of all absolutely continuous distribu- 
tions, then a functional 0p = 0(F) is a real valued function defined for every F € F Then 
0p : F — R is a mapping from the abstract space to real line. In nonparametric infer- 
ence, the objective is to learn the value of 0p = 0(F) for a known functional 0 based on iid 


observations from an unknown DF F. 


Example of functionals 


Suppose X;, i = 1,2, .., n are iid observations from an unknown DF F(x), where F is unknown 
but known to be continuous. Few possible functionals are: 

1. Op = Epl(X < a), for some known a, where F is the class of all absolutely continuous 
DFs. 

2. Op = Ep(X), where F = {F : Ep(|X|) < co} 


3. 0p = Ep(X?), where F = {F i Ep(X?) « oo} 


Linear functional 


Each of the functionals can be expressed as Epy(X), for some v. 

For examples 2 and 3, Y(X) = X and X?. 

For example 1, Y(X) = I(X < a), where /( ) is the indicator function. Then 0(a.Fi + 
(1 — a) £3) = afp, + (1 — o)0p, for any 0 < a < 1. The above functionals are called linear 


functionals. 


Nonlinear functionals 


Suppose X;, i = 1,2, .., n are iid observations from an unknown DF F(z), where F is unknown 
but known to be continuous. Then the following are also valid functionals: 

1. Op = {Ep(X)}’, where F = {F : Ep(|X|) < oo] 

2. Op = Varp(X), where, F = {F : Ep(X?) < oo] 

The above functionals are nonlinear functionals as 0(a@F\(1 — a) Fz) Z aðr, + (1 — a) Op, for 


0«aoc« Il. 


Further examples 


Suppose X;,7 = 1,2,..,n are iid observations from an unknown DF F(x), where F is un- 
known but known to be continuous. Then the following are also nonlinear functionals: 
1. 0p = F= (4), the median , where, F is the class of all absolutely continuous DFs. 


—1/3\_p—1/1 
2. 0p = e the quartile deviation, where, F is the class of all absolutely continu- 
ous DFs. 


A critical example 


Suppose F is class of all absolutely continuous DF with finite expectation. Then the question 


is what will be the type of the kernel Op = (Ep( X))?? 


5 


Take F; € F,i = 1,2 Then for any o € (0, 1), 
Eor-o)E CX) = o Ep, (X) + (1 7 o) Ep, (X). 
But (Ear. ca-à&(X)P A {Er OP + (1 - a) {Ep OO. 


Therefore, Op is not a linear functional. 


Plug in estimator: Analogue to statistic 


Suppose X;,7 = 1, 2,.., are iid observations from an unknown DF F(x), where F is unknown 
but known to be continuous. F, is the empirical DF based on the data. 6(F;,) is the plug-in 
estimator corresponding to the functional 0(F) 

1. Er Y(X) is the plug-in estimator corresponding to the functional Epy(X). 


2. F, (4), the sample median is the plug-in estimator corresponding to F~'($). 


A. misleading example 


Suppose F is class of all absolutely continuous DF with symmetry at the origin. Then what 
will be the type of the functional 0p — P(X » 0)? 
At a first look it appears as a linear functional. But symmetry at the origin gives 0p = I. 


Therefore, 07 is not a functional. 


Nonparametric Inference: Module 3: 


What we provide in this module 


Estimability 


Kernels and Symmetry 


Sum and product of kernels 
e U statistic 


e Few examples 


1 Co-Ordinator: Dr. Rahul Bhattacharya, Department of Statistics, Calcutta University 


1 A short recap 


5o far we have learnt the following 
e Functionals are the analogues of parameter of interest. 
e Statistical functionals are the analogues of statistic. 
e Functionals are linear and non linear. 


e Functionals and statistical functionals form the basis for nonparametric inference. 


2 Inferential problems in nonparametrics 


As in the parametric inference, we have three different types of inferential problems: 
i.Estimation 

ii.Hypothesis Testing 

iii. Confidence Interval estimation 


In this module we start with nonparametric estimation. 


3 Estimability 


We have already introduced functional and a statistical functional. Suppose X1, X5, ..., Xn 
are iid observations from F and 6p is a functional defined on JF, the class of all absolutely 
continuous DFs. 

Then ðr is said to be estimable if there exists a $(.X4, X5, ..., X,),k< n, such that, 
Ep(Q( Xi, Xo, ..., X~)} = Op, for all F € F. 

If the above holds for some $ and k, 0p is called a regular functional. 

@ is called a kernel for estimation of 07. Minimum value of k, ensuring the above is called 


the degree of the kernel. 


4 Sum / Difference of the kernels 


If à; is a kernel of degree k; for 0;(F),i = 1,2, then we are interested to know degree of 
$1 X $9. Observe that 6; + $» is a kernel for 04(F) -05(F) Few observations can be common 
between $4, and $». 


Assume that kı < k2. Then $4 + $» is based on at most kz observations. If kı > ke, then 
$1 X $» is based on at most kı observations. In any case ¢; + $» involves at most max(ky, ka) 
observations. So, degree of à, + à» is at most max(ky, k2). 


Generalization: Degree of $ 7 , +; can not exceed max(ki, ko, .., ks) 


5 Product of kernels 


If ó; is a kernel of degree k; for 0;(F),i = 1,2, then we are interested to know the degree 
of $19». Suppose there are some common observations in $1 and $». Then $, and ¢2 
are, in general, dependent. Thus product of them is not a kernel for 01(F)03(F). Suppose 
01(F)03(F) is the functional of interest. Then $195 is a kernel for 64(F)03(F), if they are 
based on different sets of observations. Assume that the sets of observations corresponding 
to kı and ks are different Then $4695 involves kı + ka distinct observations. Degree of $19» 
is ky + kə. 

Generalization: Degree of the kernel [[7 , +4; is $5; k; for the functional []}_, 6;(£). 


6 Symmetric kernels 


A kernel $(2, £2, ...,%%) is called symmetric if, $(z1,22,..., 24) = (Ti, Liz,- Zip) for any 


permutation (44, i5, ..., ik) of (1,2, ...., k}. 


T, : T . 
1 $; x; or any function of $77 x; are symmetric kernels. 


2 Median, quantiles or any function of them does not depend on the order of the data 


and hence are symmetric kernels. 


3 I(zx;-- x5 1) is a symmetric kernel. 


7 Asymmetric kernels 


A kernel ó(z1, £2, ..., £k) is asymmetric if it is not permutation invariant, i.e. ó(x;,, Ziz; ..., Li, ) 


and (£1, T2, ..., Zk) are not the same for some permutation (44, i2, ..., ik) of (1,2, ...., k}. 


1 O(21, 22) = x? — z1% is an asymmetric kernel as ó(z1, £2) 4 ó(2, 21). 
2 zı — T and $ are further examples of asymmetric kernels. 


3 As another example, (£1, £2, 3) = z2-4-x2--x2—1r3is asymmetric as ọ(£1, £2, 3) = 


(T3, 22,21) but $(z1,22,23) Z (T3, 21, 22). 


A useful result 


Result: For every regular functional, there exists a symmetric kernel. 

Proof: Suppose X;,.., X, are iid observations from F and 0p is a regular functional. If 
Q (1, 22, ..., £k) is an asymmetric kernel then Eó(zi, x5, ..., £k) = Or. As the observations are 
iid, 


EQ(Xi,, Xin, 3 Xip) = 0r 


1? tk 


for any permutation (71, i2,..., ip) of {1,2,....,4}. We construct the symmetric function 
ONE, eg Ek) = i xc QUE Mix, wes). Then 
EQ,(Xi, Xo, wey XE) RES Op. 


Without any loss of generality, kernels can be taken as symmetric. 


8 U statistic 


Let X1, X2, ..., Xn be iid observations from F. For any kernel $ of degree k, U statistic is 
defined as 
(n — k)! 
Dope $6 0040 993 G,)- 


1<ir io. is En 


If o is symmetric, U, takes the form 


1 
US fel ` Plig Tiy is 
k/ 1<ii<i2<..<ik ín 
We note that 
e U, is permutation invariant 


e U, is based on all observations. 


e For further discussion, we consider @ as symmetric. 


Example 1 


We consider the set up F = {F : Ep|X| < oo], 0p = E(X1) 
Take @(X,) = X1, then Eo = 0p and ¢ is a symmetric kernel of degree k = 1. 


Corresponding U statistic is: 


Thus we get the usual unbiased estimator. 


Example 2 


We consider the set up F=class of all absolutely continuous DFs and 0p = P(X, € a), a is 
known. We observe that 0p = EI(X, <a). This suggests to take 6(X1) = I(X1 < a), asa 


kernel ó is a symmetric kernel of degree k = 1. Then the corresponding U statistic is: 
Un =1 DI I(X <a) 
We get the usual unbiased estimator-proportion of observations less than a. 


5 


Example 3 


We consider F = {F : Ep|X| < oo] and 0p = (E(X1) P. Since, E(X1X3) = E(X1i)E(X3) = 
(ECX1) P, we take ¢(X1, X2) = X1 X, then Ed = Op, where (Xi, X4) = {Xn}? — n with 


s? as the sample variance with divisor n — 1. 


Example 4 


We consider the set up F = {F : Ep X? < oo}, 0p = Var(Xı) Observe that Var(Xı) = 
E(X?) — (E(X1) P. Thus 0p is a difference of two quantities. For the first quantity, a kernel 
is X? and a kernel for the second quantity is X4 X». 

We take the difference of these two as a kernel. That is we take $9(.X1, X3) = X? — X4X5, 
as a kernel. But $9 is asymmetric kernel of degree k = 2. Corresponding symmetric kernel 
is, 

$(X1, X2) = $160 (X1, X2) + do(X2, X1)} = $08 — X3)? 


Corresponding U statistic is 


Um dy Yo Xin Xi). 


2/ 1<t1 Xin En 


We can simplify U as 


es = S (Xa — Xa)? 


(o s 2 
2/ 1<i1<ig<n 


= si bs ys fal 


2 41—142—1 
1 
~ 2n(n — 1) Dx: emo Xx? us] Om 
i=l i2=1 i1=1 i2= 
=. aea =? 


Example 5 


We consider the set up, F = (F : EpX? < co}, 0p = E(X1) 
Observe that 0p = Var(X,) + {E(X,)}? For the first component, a kernel is X? — X4X» 


6 


with degree 2 and for the second component, a kernel is X;.X5 with degree 2. For 0p, the 
kernel is the sum X7, which has degree 1. 


Corresponding U statistic is 
Un = z Mii 


We at once observe the following: 

The kernel is a sum of two kernels, each of degree 2. But the resultant kernel is of degree 
1. Thus the degree of the sum of two kernels can be strictly less than 2. The inequality can 
be strict in Result. Again, one of the kernels is symmetric and the other asymmetric. But 
the resultant kernel is symmetric. In particular the sum of two symmetric kernels are always 


symmetric. 


Example 6 


We consider the set up F=class of all bivariate absolutely continuous DFs. (X;,Y;),i = 
1,2, .., n are iid observations from F € F and 0p = Ep(XiYi), Var(X +Y). 

Define a new variable Z; = X;Y;,i = 1,2,.,n. Then Z; are iid observations from some 
univariate absolutely continuous DF G. Thus p reduces to 0g = EgG(Zi). Then the corre- 


sponding U statistic is U, = 177 Zi 


Example 7 


We consider the set up F=class of all bivariate absolutely continuous DFs (X;,Y;),7 = 
1,2, ..,n are iid observations from F € F and 0p = Var(X +Y). 

Define a new variable Z; = X; + Y;,i = 1,2,..,n. Then Z; are iid observations from some 
univariate absolutely continuous DF G. Thus 0p reduces to 0g = Varg(Zi). Hence the 


corresponding U statistic is U, = —— b Zz? — nZ?! 


i=1 


Nonparametric Inference: Module 4: 


What we provide in this module 


e Sufficiency of order statistics 
e Improvement using U statistic 


U is MVUE 


Exact variance of U 


Bounds for variance 


e Few examples 


1 Co-Ordinator: Dr. Rahul Bhattacharya, Department of Statistics, Calcutta University 


1 A few concepts 


Suppose X = (Xi, X2, .., Xn) are iid observations from an absolutely continuous distribution 
F. F is the class of all absolutely continuous DFs. Y = (Xa), X), .-, X(m)) is the full set 
of order statistic corresponding to X. P, is the set of n! permutations of the first n natural 


numbers. The joint pdf of (X1, Xs, .., Xn) given the full set of order statistics is 
1 
f(xly) — pe € Pa. 


Thus Y is sufficient for F. 


2 U statistics as conditional expectation 


Suppose 6p is a functional defined on F. Also suppose ¢(X1, X», ..., X), k€ n is a kernel for 
0p. That is, (X1, Xs, ..., Xx), is an unbiased estimator of 0p. Then 
1 
E(9(X1, Xo, .., Xy)| Y] = G No  WOG Gg Kis Xe) 
k/ 1<i1<i2<..<ip ín 


The RHS is nothing but the conventional U statistic with kernel $ of degree k. 


3 An improved estimator from an Unbiased estimator 


Let T4, (X1, Xo, ..., Xn) be an UE of 0p. Corresponding U statistic is, 
Un = E PS Ta (Xan, Xi, END C ) = E(T, X1, X2, sage) |Y) A 
(i1,i2,...,4n)€ P. 


Then E(U2) = E(E(T,|Y)P € E(E(T2|Y)) = ET2. Since E(U,) = E(T,) = 0p, 
Var(U,) < Vari) 


Thus U, is an improvement over the UE Th. 


4 U statistic & MVUE 


Suppose F is the class of all absolutely continuous DFs. Then Y = (Xa), Xo, .-, X(n)) is 
complete sufficient for F(see, Fraser,1957). U, is symmetric in the observations and hence 
is a function of the full set of order statistics. Thus U,, is Unbiased and function of complete 


sufficient statistics. Therefore U, is MVUE for 0r 


5 Exact variance of U statistic 


Suppose @(X 1, Xo, ..., X;),k< nis asymmetric kernel for 0p and F is the class of distributions 
with finite Vr(¢(X1, X2, ..., X). Define 

Om Ep T2, Em) = EQ X2, uy us ab tas Xm = Tm) 

—4 x9 Em, X uice Xk), 1< mM < k and 

55 = Var|om( Xi, Xo, Xm], Tm ER 


Then Var(U,,) can be expressed as E D» k(m)o2, pa (m) = GA) lcm s 


A related observation 


With the already introduced notations, the smallest and the largest members of the set 
(02,02, .., 02) are, respectively o? and of. 


Applying Jensen inequality, we get 


EQ (Xi, X», sey Xm41) = EE(92, (Xi, Xs, +) Xm41)|Xm4i} 


EXE |om+i (Xi, Xo, tU) Xm )| Xml” 


IV 


Since Elém4i(X1, X», RE Xm+1)|Xm+1| = Qm X1, Xo, ar Xm), we have E92, (X3, X», des Xm-41) > 
HONG Noes, Xu) 


As c2, = E¢?,(X1, X2,..,Xm) — 02, we get 02,,, > o2, for any m. 


6 Derivation of Var(U,,) 


First of all note that U, is a sum of dependent variables. Therefore the covariance terms are 


not always zero. Thus 


Var(U,) = (i) j ` ` Covi (Xi, E Mies MX; tU Xj )} 


1<t1 <..<tp <n 11 <..< jp <n 


We start with proving a related lemma. 


Lemma: If m is the number of members common between the sets (41,45, .., ig} and 
Uu ja s jk}, we have Cov{d(Xi,,.., Xi,), 9(X5,,..,.X5,)) = 02, where of = 0. 
Proof: Since X;,i = 1,2,..,n are iid , we have Cov{o(Xi,,.., Xi,), 9 (X5,,.., X5,)) = 0 or 
#0 as m = 0 or m > 0. Consequently, for m Æ 0, 
EO Xira Xid(X5, .. X4)) 
= BO X epo sd oi bre CR Cx dota ot EE X2k-m) 
Now, given X1,..,Xm, the above two kernels are independent. Then by definition 
EA OUK rey Xm, Mega yaks et pos Xm} = ai e e Xm) 
E(QÓ(X1,.., Xm; Xyoy +) Xox m)| Xy, - Xm} = ós(X1,.., Xm) Thus 
EXO Xigs -o X40(X5,..,X4,)) = Ebm Xi Xm) 
Hence, 
CoU10 X55: 0X 45 X) 
=EG(Xi,, o Xin OX) Xj) — OF 
Howe, Xm)? — 0} 


2 


=O, 


And the lemma follows. Next we continue to the result for variance. 


In view of the lemma, the contribution of each covariance term is either 0 or o2,. For a 
specific m,1 < m < k, the coefficient of c2, is the number of choices of the sets (i4, i2, .., iz} 


and {j1, j2, .., jk } having m elements in common. The number of such choices is Mg dd io). 


k 
m 


n 


^ ) ways of 


Because, there are ( ) ways of choosing a particular {7,,72,..,7,} and then ( 


choosing c elements from them. Again, there are ( nd ways of selecting the remaining k — m 
elements of (71, j2, .., jx} from the remaining n — k numbers. Finally,for a particular m, there 
are (7) M d terms with covariance o2,. Then Var(Un) = (7) = y id ba (jos. 


k) \m/ Nk—m m=1 m/ \k-m 


A simplification gives the desired expression of variance. 


7 An alternative look at the variance 


Note that p, (m), 1 € m < k is the pmf of a Hypergeometric distribution. Define M, a 
random variable with P(M = m) = p, (m) Then, we have the following representation of 
U statistic U,,: 

E(U,|M) = 6p and Var(U,|M) = 03;. 

It follows from such a representation, E(U,) = EE(U,|M) = 0p and 


Var(U,) = E{Var(U,|M)}+4+ Var( E(U,|M)) 
EBEN 


k 
= ` o2 P(M =m) 
m=1 
k 
= 5 S ezpas(m) 
m=1 


8 Exact bounds for Var(U,,) 


We have already obtained that 
o? € o2 <... € o2 and Var(U,) = Y, o2, p, (m) 


For fixed n, k, pu (m), 1 € m < k is the pmf of a Hypergeometric distribution. Then it 


follows easily that o? < Var(U,) < of. 


9 Finding Var(U,) 


Assume that X;, i = 1,2,..,n are iid observations from F. 

Example 1: 0p = Eg(X1), (X1) = Xi, k = 1, o? = Vard(X1) = Var(Xi), Var(U,) = 
Var(X,)/n 

Example 2: 0p = Epr(X{), 6(X1) = Xj, k = 1 r is a known positive integer. Assume 
that E(X?) = u, < oo for s = 2r. o2 = Vard(X1) = Var(X?) = po, — w2. Thus 
Var (Un) = (Har — u2)/n 


Example 3: 0p = Pr(X1 € zo), 6(X1) = 1(X1 € zo); k= 1 
Assume that ro is known, then o? = Vardé(X,) = VarlI(X, € zo). Thus Var(U,) = 
0p(1— 0p)/m. 
Example 4: 0 = {Epr(X1)}*, ¢(X1, X2) = Xi1Xo, k = 2. Assume E(X1) = u and 
Var(Xi) = o? < oo 
o1(X1) = Etlo(X1, X2)| Xi} = wy 
c? = Vargı (X1) = po? 
$2(X1, X2) = O(X1, X2) = Xi X2 
o2 = Vardo(X1, X3) = Var(X1X25) = a* + 2,26? 
Var(U,) = OC te? | e (c^ + 2120?) 
Example 5: Op = Pae (Xi, X2) = i(X; — X3, k=2 
Assume F(X 1) = u and Var(X,) = 07, E(X; — p)* = u4 < oo. 
$1(X1) = E{O(X1, X3) p) = 07/2 + (Xs — u)” /2 
of = Varéi(X1) = Var((Xi — u)*/2) = (u4 — 0*)/4 
$»(X1, X2) = $(X1, X2) = (X3 — X2)? 
o2 = Vardo(X1, X3) = (u4 + 0*3)/2 


Var(U,) = Eu, = A SO ug + 08) /2 = (ua — (n — 3)e*/n = 1)/n 


Nonparametric Inference: Module 5: 


What we provide in this module 


e Consistency of U statistic 


Asymptotic normality of U statistic 
e Degeneracy of U statistic 


e Few examples 


1 Co-Ordinator: Dr. Rahul Bhattacharya, Department of Statistics, Calcutta University 


1 Large Sample Properties 


Unbiasedness and possessing minimum variance are the small sample properties of an esti- 
mator. To judge the performance of an estimator for large sample sizes, one uses Consistency 
and asymptotic normality. Consistency gives the limiting value whereas asymptotic normal- 
ity specifies the rate of convergence to the limiting value. A consistent estimator having 
asymptotic normality is known as Consistent Asymptotic Normal(CAN). U statistic is a 


CAN estimator. 


Result: | Suppose U, is an U statistic with kernels of degree k and o? > 0 and 
o? < œ, i > 2. Then Times. Var(y mU.) = k*es. 
n—k k—a—1 : 

A m) = k! Izk (n—j) 

Proof: For any a, n G) - kai Tata 


of k-a+1 terms while the denominator involves k terms only. Each term is linear in n. 


. Then the denominator is the product 


Thus for a > 1, the RHS goes to zero whereas for a=1, it goes to k. Thus Var(./nU,,) = 


k?a? + o(n—').Hence, as n — oo, Var(./nU,,) > k?o1. 


2 Consistency of U statistic 


Now we come to the proof of consistency of U statistic. First of all, by definition, E(U,,) = 


0p. Consider the representation Var(U,) = Var(4/nU,)-. Then, as n — oo, the first 


1y 
term approaches k?c? by the previous result. The second term converges to zero. Thus 


Var(U,) — 0 as n — oo. Hence consistency follows. 
3 Asymptotic Distribution of U statistic 
Result: Provided o2 > 0, as n > oo, 


Vn(U, — Or) — N(0, k?o?) 


2 


in distribution. 


Proof: We start with the fact that for o? = 0, U, = 0p with probability and hence 
assume g? > 0 for further development. 
First of all, we note that U, is not a sum of iid random random variables so that the large 
sample distribution can not be obtained by a straightforward application of Central Limit 
Theorem. 


Consider the representation, 


Vn(Un — 0r) = Fe Dia {O1(Xi) — Ov} + en ++ (9) 


where 
e, = VU — 8x) - “= Y té 0) - 85} 
n n F vn € 1 i Ff- 
Now $4(X;) are iid random variables, and hence by Lindeberg-Levy CLT 
S (605) - 68) B N(0, #03) 
vn i=1 iE 


as n — oo. Then the proof will follow, if we can establish that €n 5 0 as n — oo. 


Since E(e,) = 0, we shall show only that E(c2) > 0. Now 


E(e?) 


n 


Var{/n(U, — 0p)) + varl ECCO — ôr} 
- 2Co( V RU, — 6p), 3 Yee NE 


The first term in the RHS of (*) converges to k?o?. The next term in the RHS of (*) 
converges again to k?c?. Thus the proof will be complete provided the last term in the RHS 


also converges to k?o2. 


Now 
Cov( /n(U, — 07), E Yee — 0r ]) 


= a 2s a Cov{o(Xi,, .., X4), *1(Xi)] 


Since the observations are iid, the inside covariance term is nothing but Cov($( Xi, .., Xx), 61(X1)) 
A simple use of conditioning expresses the covariance as o7. 

Now there are (1) (a) covariance terms with value c2. Then we get the desired ex- 
pression (i.e. k?o2) for the last term in the RHS of (*). As a result J/n(U, — 0r) and 
A yo {¢1(X;) — Or} have the same asymptotic distribution. Since the latter has the 


desired asymptotic distribution, the proof concludes. 


4 Few Examples-k = 1 


Assume that X;,7 = 1,2,..,n are iid observations from F. Example 1: 0p = Ep(Xı), 
Oki = Xi, Un = Xn 

o? = Vard(X1) = Var(X1) 

vn(Un — Or) 5 N(0, Var(X1)) 

Example 2: 0p = Er(X1), 6(X1) = Xi, Un = 134 XT 

cj = Varg(X1) = Var(X1) = uy, — My 

Vin — Or) ^ N(0, po. = n7) 


5 Few Examples-k = 2 


Example 3: 0r = {Ep(X1)}?, 6(X1, X2) = X1X5, Un = X? — s?/n 
Assume F(X 1) = u and Var(Xi) = o? < oo 

o? = V are Xa) Wo" 

Vii(Un — Or) ^ N(0, 4:70?) 


Example 4: 0 = Var(X4), 6(X3, X3) = I(Xi S) Un = 3? 
Assume E(X,) = u and Var(X4) = 0°, E(X; — u)* = u4 < oo. 
of = Varói(X1) = Var{(Xı — u)’ /2} = (pa — 0*)/4 
Vn(U, — Or) 5 N(0, u4 — o*) 


6 Degeneracy of U statistic 


It may happen that the asymptotic variance of a U statistic is zero. For example, sup- 
pose 07 = {Ep(X,)}*. Assume E(X;) = u = 0 and Var(Xi) = c? < oo. Then 
c? = Vard,(X,) = wo? = 0. Naturally, /n(U, — Or) converges to a degenerate distri- 
bution. Thus the asymptotic normality does not hold. 


Definition: Consider a U statistic with symmetric kernels of degree k. U is said to have 


a degeneracy of order m if 


2 2 2 
sen 2 =0, 


but o2,,, > 0. Alternatively, since for a U statistic with kernels of degree k, 


OF S05 S++ SG, 
the definition can be restated as 


U is said to have a degeneracy of order m if o7, = 0 but oĉ; > 0. 


An Example 


Consider the example already stated. Naturally, for u = 0, 0? = 0 but o3 = o^ > 0 Then 
U has degeneracy of order 1. Under the presence of degeneracy, we need to use different 
normalizing constant to reach a non-degenerate limiting distribution. To obtain the limiting 


distribution under degeneracy, we start from 


The first term in the RHS of (*) converges in distribution to a o?x? distribution. The second 
term converges to o? in probability. Now an application of Slutsky's Theorem gives that 
the LHS of (*) converges to a o?(x? — 1) distribution. Thus we need n as the normalizing 
factor to get a non-degenerate limiting distribution. However, the limiting distribution under 


degeneracy is no longer normal. 


Nonparametric Inference: Module 6: 


What we provide in this module 


Functional in two sample problems 


Estimability and kernels 


Symmetry of kernels 


Two sample U statistic: Definition & properties 


Examples 


1 Co-Ordinator: Dr. Rahul Bhattacharya, Department of Statistics, Calcutta University 


1 Functional 


Suppose X1, X», ..., X, are iid observations from F and Y1, Yo, ..., Ym are iid observations from 
G. The two samples are independent. F and G are unknown but known to be continuous. 
Suppose F is the class of all absolutely continuous DFs. Then 0(F, G) is a functional defined 
on FXF 


2  Estimability & Kernels 


0(F, G) is said to be estimable if there exists a @(X1, Xə, ..., Xa, Yi, Yo, .., Yp),a,0 < n, such 
that, 
Ep{o(X1, Xo, Xo, Yi, Yo, , Yo)) = O(F,G) for all (FG) € FXF. 


$ is called a kernel for estimation of 0(F, G). Minimum value of (a, b) with a,b < n, ensuring 


the above is called the degree of the kernel. 


3 Symmetric Kernels 


A kernel ó(X;,, X5, ..., X;,, Yi Yi, .., Yp), is said to be symmetric, if it remains unchanged 
for separate permutations of {i1, i2, ..., ia} and {J1, jo, ..., jp}. As in the single sample case, 
corresponding to any asymmetric kernel there is always a symmetrized version. We are not 


going to give the rigorous proof and try to increase understanding by means of few examples. 


3.1 Examples 


Assume two independent samples X; ~ F,i = 1,2,..,.n and Y; ~ G,j = 1,2,..,m. 
Example 1: Consider 0(F, G) = P(X, < Yi). Since 0(F, G) = EI(X, < Yi), we consider 
the kernel (X17, Yi) = I(X1 < Yi), where I[( ) is the indicator function. Clearly ó( X1, Yi) is 


a kernel of degree (1,1). The symmetry is immediate. 


Example 2: Consider 0(F,G) = P(X, < Yi, Xo < Yi). Since O(F,G) = EI(X1 < 
Yi, Xə < Yi), we consider the kernel ó( X1, X2, Ya) = I(X1 < Yi, Xo < Yi), where I( ) is the 
indicator function. Thus ¢ is a kernel of degree (2,1). Naturally @ does not change if X 


and X» are interchanged. Hence ó is symmetric. 


Example 3: Consider 0(F, G) = P(X, < Yı < Xə). Since 0(F,G) = EI(X1 < Yı < 
Xə), we consider the kernel ¢(X 1, X5, Y3) = I(X4 < Yi < X5), where I( ) is the indicator 
function. $ is a kernel of degree (2,1). But $ changes if X; and X» are interchanged. Thus 


; Q IXi«Yi«X2)I(Xa«Yy «X 
ó is asymmetric. Corresponding symmetric kernel is 1% <% 2r pi), 


Example 4: For 0(F, G) = Vr(X1) — Ve(¥1), we consider the kernel ¢(X1, X2, Y1, Y) = 
{X?—X1Xo}+{V¥?-NYo}. ¢ is a kernel of degree (2,2). But à changes if X; and X» and/ 


or Y; and Y are interchanged. Thus $ is asymmetric and the corresponding symmetrized 


i i X1—X3)?- (Yi -Y2? 
version can be obtained as &:—*2 5 Y-Y) 


4 U statistic-Definition 


Define Cr = {i4, i2, te 1 <i <i <.. <ia < n} and C = {j1 Ja, Ji 1 < ji < 
ja < . < jy € m}. Then we have in total (7) (7) possible kernels of degree (a,b). If is 
asymmetric kernel of degree (a,b), then the corresponding U statistic is defined as 


U= (7) 5) ya 35:906, Xin, re ie ads UE 


Cy C2 


4.1 U statistic-Properties 


As earlier E(U) = 0(F,G), that is, U is unbiased. Again U is a symmetric function of the 
observations. Naturally, U can be expressed in terms of the full set of order statistics. Since 

the full set of order statistics is complete sufficient under fairly general assumptions, U is an 
unbiased estimator based on the complete sufficient statistics. Thus U is MVUE of 0(F, G) 

even in two samples. 

For the calculation of exact variance, define @ca(%1, .., Ze, Y1, +5 Ya) = E{O(X1, Xo, ..., Xa; Vi, Yo, --, Yo) l 
2,1—1,2,.,6Y; = yj =1,2,..,d} 

o2,= Covarince between $(X1.., Xe, Xc44..Xa; Y, Y2, ., Ya, Yarı., Yo) and 6(X1.., Xe, X, 4. X, Yi, Yo 
for c -- d > 1, where X; are iid as F and x are iid as G. Then the exact variance of U can 


be expressed as Var(U) = 7 axi uL and 8 


b 


4.2 Examples 


Assume X; are iid as F and E are iid as G. 
Example 1: 0(F,G) = E(X1) — E(Yi). Take (X1, Y1) = X1 — Yi, then ¢ is symmetric 
with degree (1, 1). Then the corresponding U statistic is: 


U = O Poe Pi y)-2X-Y. 
Thus we get the usual unbiased estimator. Since $ has degree (1,1), we have the choices 
(0,1),(1,0) and (1,1) for (c, d). Thus for the calculation of variance, we need to calculate 
05,019 and of,. Now oj; = cov(9(X1, Y1), (X3, Y1)) = Vara(Y1) od = cov(603, Y1), 6(X1, Y1)) = 
Varg(X1) o2, = cov(6(X1, Yi), 6(X41, Y1)) = Varg(X1) + Varg(Yi) Then, a simple manipu- 

lation expresses the variance as 


Var(U) — Ve a) Vea) rine Aan Vea} 


— Vr(%1) Vea) 


n m 


Example 2: Suppose 0 = 0(F,G) = P(X, < Yi). We take ó(X,, Y1) = I(X1 < Yi), a 
symmetric kernel with degree (1, 1). Corresponding U statistic is: 
Uc A s rotes d) 
U is the famous Mann-Whitney statistic for comparing distributions. As earlier, we have 
the choices (0,1),(1,0) and (1,1) for (c, d). Thus for the calculation of variance, we need to 
calculate only 02,,72, and o?,. Since E¢(X;, Y;) = 0(F,G), o2, = P(X1 < Yi, X1 < Y) - 0. 
A simple application of conditioning methods give of) = f(1— G(x))?dF(x) — 6? Similar 
applications give: 
ojü-P(Xi«Y,X|«Y)-9?— f Ferace) -g 
oj = Var(I(X1 < Y1)) = 6(1 — 6) 


Using these, the final expression of the variance can be obtained. 


Example 3: For 6(F,G) = Var(X,) — Var(Yi), take ae a symmetric 


kernel with degree (2,2). Corresponding U statistic simplifies to: 


U=; DaX =K) ES mca 24 = sk — 8} 


Since ġ has degree (2,2), we have a total of 3?—1 possible choices, namely, (0,1),(1,0),(1,1),(1,2),(2,1),. 
and(2,2) and for (c,d). Then o2, = cov(¢(X1, Xo, Yi, Y2), 6(X;, X5, Y;, Y2)) A simple calcu- 

lation gives 

o4, = Var{(Y2 — py)*}, and of, = Var((X1 — ux)?). In a similar fashion the other terms 

and hence the final expression can be obtained. Although the derivation is straightforward 

but is time consuming. Now it is easy to observe that U is a difference of two univariate U 
statistics. Also the two univariate U statistics are independent. That is U = U, — Us , where 

U, and Us are independent univariate U statistics. Then Var(U) = Var(U) + Var(U3). 

From the previous deductions, variances of U} and Us and hence the final expression of 


Var(U) can be obtained. 


5 Consistency of U statistic 


Let U be the U statistic corresponding to a symmetric kernel of degree (a,b). Assume n 
observations are taken from F and m observations are taken from G. Suppose the following 
conditions hold: 

(A1) o2, > 0 for (c,d) = (0, 1), (1,0), (a, a) and 

(A2) Suppose for each N = n+ m, there exists n = n(N) and m = m(N) such that as 
N — oo, 

min (m, n) — oo and & (x) — A(1 — A) with A € [0, 1]. 

Then under (A1) and (A2), 

Var(V NU) > o? = eto + veo Now we have E(U) = 0(F, G) and consider the representa- 
tion 

Var(U) = nV ar(VN U) Then considering the previous results, we observe that as min(m, n) > 
oo, Var(VNU) — o? and + — 0 Combining, we get that as Var(U) > 0. Therefore, the 


consistency Of U statistic follows for two samples. 


6 Asymptotic Distribution of U statistic 


Result: Under conditions (A1) and (A2) above, as N — oo, 


V N(U —(F, Gy) 5 N(0,o?), 


bo, 
1-A 


Proof: Define, Zy = VN(U — 0) — E Y? (o10(Xi — 0) — 9 77 (bo1 (Y; — 6). Then, 


a2 2 
where o? = 528 + 


it can be observed that 
(i) E(Zu) =0 
(ii) By CLT, the second and third terms in the RHS of Zy are asymptotically independent 


and normal 


(iii) E(Z%,) + 0 as N > oo. 
Thus 

Te 
as N — oo and hence 


VN(U — 0) 


and 


ENS 3 ow(Xi) i m (a (Y; 


i=1 
have the same asymptotic distribution. The latter is E N (0, c?) and hence the 


proof is complete. 


6.1 Examples 


Example 1 revisited: For 0(F,G) = E(X,) — E(Yi) applying the above result and the 


expressions of oĉ) and oå, we get as N — oo, V N(X — Y — 6) 2, N(0,0?), where o? = 


vig | 9à — Vr(X1) , Ye) 
A ! 1-A A O 1-A * 


Example 2 revisited: Suppose 0 = 0(F,G) = P(X, < Yi). Then as N > oo, 
V N(U — 0) 3 N(0,0?), where 


N 


— fü-GG))dFG)- , [FGyaG()-0 
BE A 1—A 


If F =G, then 6 = 1, of) = cà, = $. Thus under F = G, we have 


1 


2 ^ NO, d cy 


1 
E 


Example 3 revisited: For 0(F, G) = Var(X,) — Var(Yi), we get with the already derived 


expressions, as N — oo, VN(U — 6) 3 N(0, 0?), where 


gà = VarlGXi-uxYY | Var((Ya- uy!) 
d A T E 


1—A 


It is easy to observe that the asymptotic variance is the weighted sum of the asymptotic 


variances of two univariate U statistics. 


Nonparametric Inference: Module 7! 


What we provide in this module 


e Nonparametric hypothesis testing and confidence interval: Why the need? 


e Components of a nonparametric test 


Consistency of tests-Some basics 


Different nonparametric hypothesis testing problems 


A distribution free confidence interval for the population quantile 


‘Co-Ordinator: Dr. Rahul Bhattacharya, Department of Statistics, Calcutta University 


1 Nonparametric hypothesis testing and confidence in- 


terval: Why the need? 


The central idea of parametric (or classical) inference is the assumption regarding the un- 
derlying population. The entire theory of the parametric inference is developed under this 
assumption and consequently these procedures are valid as long as these assumptions are 
satisfied. For example, students t test is appropriate only when the underlying distribution is 
normal. But normal is not the only distribution having applications in real life. For example, 
in survival trials, the lifetime distributions are mostly exponential, gamma or Weibull, that 
is non normal. Therefore, inference based on t test in such a situation is often misleading. 
Therefore, parametric tests are useful only when the experimenter is sufficiently confident 
about the underlying distribution. But unfortunately, the available methods of identifying 
the underlying distribution is limited to standard distributions only. 

Therefore, it will be better if hypothesis testing procedures can be developed with minimal 
assumptions about the underlying distribution(say, continuity of observations). Suppose 
there exists a statistic T, relevant to the testing problem such that the exact and/or condi- 
tional and/or asymptotic distribution of T' under the null hypothesis is independent of the 
underlying distribution. Naturally, the significance level for the test based on T' does not 
depend on the underlying distribution, that is, these tests are level robust. Tests based on 
T are commonly termed as nonparametric or distribution free. 

In real practice, T' is often formed by taking only the signs or ranks of the actual obser- 
vations. But such a T' does not take into account the full information contained in the 
individual observations and hence are often less efficient than their parametric counterparts. 
However, nonparametric tests are the valid choices as long as the validity of the parametric 
assumptions are questionable. 

As in parametric inference problems, the experimenter may be interested in constructing 
a confidence interval with coverage probability independent of the underlying distribution. 


However, in nonparametric inference, the unknown quantity of interest are mostly location 


or scale parameters. As in the parametric counterpart, we can invert the acceptance region 
of a distribution free test to get a distribution free confidence interval with high coverage 
probability. These will be discussed later. However, if the unknown quantity of interest is 
a population quantile, we can develop a distribution free confidence interval based on the 


sample order statistics. 


2 Components of a nonparametric test 


In most of the problems of nonparametric inference, the underlying distribution is not spec- 
ified except for continuity. Therefore the available methods(e.g. Neyman Pearson lemma, 
likelihood ratio method) of test construction are of no use and hence, we need to develop 
tests with intuitive appeal. In particular, if we can identify a distribution free statistic T' for 
the problem, we can develop a distribution free test. However, for a meaningful development, 


we maintain the following sequence: 


1. Model assumption(i.e. the minimal set of assumptions about the underlying distribu- 


tion) 
2. Hypothesis of interest(i.e. specification of the null and alternative hypotheses) 


3. Available tests for the problem(i.e. the usual parametric procedures for specific prob- 


ability models) 
4. Suggesting a distribution free statistic(i.e. specifying some T) 
5. Justifying the form of critical region. 
6. Investigating unbiasedness and consistency of the suggested test and finally 


7. Providing a large sample test corresponding to the given test. 


3 Consistency of tests-Some basics 


However, tests based on such a T' are often far from being optimal and hence properties 
like maxmising power is not immediate. Consistency is, therefore, an important concern 
to measure the sensitivity of the test in large samples to a little departure from the null 
hypothesis. We provide below the notion of consistency of tests in little details. 

Suppose X1, X2, .., Xy are iid observations from an unknown distribution G. We are inter- 
ested in testing Ho : G € Qo against H, : G € Qa, where Qo(Qa) is the class of distributions 
specified by the null(alternative) hypothesis. 


A sequence of tests (oy) is said to be consistent if for every G € Qa, 
Econ > 1 


as N — oo. 

However, such a definition is not easy to check and hence we require some sufficient condi- 
tions. 

Assume that @y is a right tailed test based on a statistic Tw. Then (ox) is consistent if 
(i) @y is asymptotically size a, that is, Egdy — «a for every G € Qo and 

(i) Tw 5 u(G), with (G) satisfying 


U(G) = uo for all G E€ Qo 


> po forall G € O,. 


The requirement on asymptotic size is not immediate in practice. However, if we can establish 
asymptotic normality of Ty for all G € Qo, then the requirements in (i) and (ii) above are 
automatically satisfied. Therefore, if there exists a constant o9(> 0) such that as N — oo 


VN (Tw — Ho) B 
90 


N(0,1) for all G € Qo, 


then óy is consistent. Therefore asymptotic normality is a stronger condition than those 


required in consistency. 


However, asymptotic normality or asymptotic size condition is not immediate and hence we 
provide some simpler conditions. Assume that G is indexed by a real parameter 0, that is, 
G(x) = G(x,0) and the testing problem can be expressed as Ho : 0 = 69 against H40 > 0o. 
Suppose a level a test rejects the null hypothesis if Sy(00) > cn(A0), where Sw(05) is so 
constructed that 


Sx (00) 5 A(0, 0o) 
with 
A(0, 0) = 0 for d= Oo 


> 0 for all » 4 


and limy.,4, Cv(99) € 0. Then the test based on Sy(05) is consistent. 

However lack of consistency is a serious problem. We shall explain such a consequence 
through an example. Suppose X;,.., X, are iid observations from a Cauchy distribution 
with location parameter 0. Suppose we want to test Ho : 0 = 0$ against Ha : 0 > Oo. 
We suggest two possible tests, one based on sample mean X and the another based on the 
sample median X. 

Test 1: Reject Ho if X > c, where c is such that Pj, (X — 05 > c) = .05. Since under the 
null hypothesis X — 6) has a Cauchy(0,1) distribution, we get c = tan(.457) = 6.314. Then 
the power function is £1(0) = Po(W > 0) — 0 + 6.314), where W ~ Cauchy(0, 1). 

Test 2: Reject Ho if X > c', where œ is such that Pg, (X — bo > c) = .05. But the distri- 
bution of X — 0, is not easy to find. However, it is known that \/n(X — 0) is asymptotically 
N(0, 27/4) distribution under any 6. Then considering the asymptotic distribution, we set 


d= T0537 = A Then the power function becomes £5(0) = P(X — 0 > 65 + A) Now 
it is easy to observe that both the tests are asymptotically size .05, but the latter is based 
on an asymptotically normal statistic whereas the former involves an inconsistent statistic. 
For better visualisation of the effect of consistency on power, we provide a plot of both the 


power functions taking 09 = 0. For a better visualisation, we compute powers for three 


choices of 0 and plot the power for various values of the sample size(n). All these can be 


5 


found in Figure 1 below. We shall explain the difference in the nature of power functions 


1 L 
` 


L 
0.6 
L 
` 


Power 
Power 


Power 
0.05 0.10 0.15 0.20 0.25 0.30 0.35 


(a) 0 = .20 (b) 8 = .50 (c) 0 2 1.0 


Figure 1: Nature of power for varying sample size(n) 


for varying n. Suppose we fix 6) = 0 and 0 = .2. It is easy to observe that the power for 
Test 1 remains very close to .05 for varying n whereas the power for Test 2 increases sharply. 
The same is observed for the assumed choices of 0. However, the rate increase for the power 
of Test 2 is higher for higher values of 0. This is expected, as the first test is based on an 
inconsistent estimator(ie. X), i.e. , it does not concentrate around the true value for large 
n. Consequently the power increases at a very slow rate for Test 1. On the other hand, X 


is consistent and hence for large n approaches the true parameter and consequently power 


increases for increasing values of n. 


4 Different nonparametric hypothesis testing problems 


Now we shall discuss the different types of hypotheses considered in nonparametric hypothesis 
testing together with their relevances. Based on the availability of data, hypothesis testing 


problems are either single sample or two sample or multi-sample. We, therefore, discuss 


hypotheses for each type of problems. 


4.1 Single sample problems 


Suppose X1, X», .., X, are iid observations from an unknown distribution F. F is unknown 
but known to be continuous. Then depending on the requirement, we have the following 


different hypotheses. 


4.1.1 Problem of location 


Suppose p € (0,1) is a known quantity and let 0(F) = (F) be the quantile of order p for 
F, that is F(0(F)) = p. Then the problem of location is to test Ho : 0(F) = 0o against one 
of the alternatives H, : 0(F) > 0) or H,:0(F) < 05 or H,:0(F) Z 0s for some known 05. 

For the above hypothesis, only the continuity of F is required. However, if we consider 
the same hypothesis with the added assumption of symmetry of F, it will be the problem 
location under symmetry. Since, the main concern in a problem of location or location under 
symmetry, is the location(e.g. median), the hypothesis is an analogue to the test of a location 


parameter in parametric counterpart. 


4.1.2 Goodness of fit problem 


In a goodness of fit problem, the interest lies in investigating whether the sample comes 
from a specified distribution. Then the hypothesis of interest in a goodness of fit problem 
can be described as Ho : F(x) = Fo(x) for all x, where Fo is a completely known DF. The 
alternative hypothesis is naturally Ha : F(x) Z Fo(x) for at least one x. The problem of 
goodness of a fit also arises in parametric inference after some known distribution is fitted 


to a data. 


4.2 Two sample problems 


Suppose X;,;4 = 1,2,.,n and Y;,j = 1,2,..,m are independent samples from unknown 
distributions F and G respectively. We only assume the continuity of observations from F 


and G. The basic hypothesis in any two sample problem is Ho : F(x) = G(x) for all x 


against the usual one sided or two sided hypothesis. Defining o = ((/,G) : F(x) = 
G(r)Vr), we can express the null hypothesis as Ho : (F, G) € To. Depending on different 
specifications of [9 and Ta = ((/,G) : F(a) # G(x) for some x }, we have the following 
possible hypotheses. 


1. General/ Homogeneity Alternative: Suppose the two underlying populations dif- 
fer in any manner(in location, scale or skewness). Then such an alternative hypothesis 
can be expressed as [, = {(F, G) : F(x) Z G(x) for some x}. The general hypothesis 


is also termed as Homogeneity alternative. 


2. Stochastic Alternative: A stochastic alternative is a restricted alternative, where 
D, = {(F,G) : G(x) > F(x) for all x with strict inequality for some x}. 


Actually G(x) > F(x) for all x implies that X observations are tend to be larger than 
Y observations or, in other words, X is stochastically larger than Y. This is a general 


class of alternatives. 


3. Location Alternative: Suppose G(x) = F(x — 0), where 0 # 0. That is, the un- 
derlying distributions differ only in location. Then F(r) > G(x) or F(x) = G(x) or 
F(x) < G(x) according as 0 < 0 or 0 = 0 or 0 > 0. Thus the null hypothesis can be re- 
stated as Ho : 0 = 0 and the alternative is either Ha : 0 > 0or Ha: 0 «00r H,:0 #0. 
Clearly 0 > 0 implies D; = ((F,G) : F is shifted to the right of G for some x}. This 
is a special case of a stochastic alternative, where G(x) = F(x — 0). Stochastic alter- 
native,in general, relates to the location alternative in a less restrictive sense because 
G(x) > F(x) indicates larger Y observations and hence corresponds to larger location 


of Y observations. 


4. Scale Alternative: Suppose G(x) = F(Z) with o > 0, that is, the two underlying 


populations are assumed to differ only in scale. Since, F(x) Z G(x) Soa Z 1, the null 


hypothesis reduces to Ho : oø = 1. The alternative is either of Ha : o > lor Ha:0o «1 


1.0 


9-0 


0.8 
l 


F(x- 0) 
Density 


0.4 


8«0 
. 0-0 050 


0.2 


0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 


0.0 


Figure 2: Effect of location on the distribution function and density 


or Ha : c #1. Scale alternative can also be viewed as a stochastic alternative, where 


It is worthwhile to mention that tests meant for general or stochastic alternative can also be 
used for location and scale alternatives. But they will be less efficient than the tests developed 
for the specific alternative. A similar set of alternatives can be found corresponding to any 
multiple sample problems and will be discussed later considering specific situations and hence 


are not discussed separately. 


4.3 Paired sample problems 


Suppose (X;, Y;)i = 1,2, .., n are sample observations from an unknown bivariate distribution 
F(x,y) with Fx(x) and Fy(y) as the marginal DF's. Then the following two hypotheses are 


of main interest: 
1. Problem of association: Here the interest lies in testing 
Ho : F(x,y) = Fx (x) Fy (y) for all (x,y) 


9 


F(x/o) 


0.4 


1.0 


0.8 


0.6 


0.2 


0.0 


1.0 


| o«1 
o=1 o=1 
o 
J e o>1 
o«1 o»1 
o | 
= o 
© 
X 
V. 
E = 4 
J a 
o 
4 o | 
T T T T T e T T T T T T T T 
-2 -1 0 1 2 0 1 2 3 4 5 6 7 
X X 
(a) Symmetric population (b) Asymmetric population 


Figure 3: Effect of varying scale on distribution functions 


against the alternative 


H, : F(x,y) Z Fx(x)Fy(y) for some (x,y). 


. Problem of location: Suppose X represents the response before a drug is admin- 


istered and Y denote that after the application of the drug. Then naturally Y are 
influenced by X and hence X and Y are correlated. Then the natural objective in this 
situation is to determine whether the drug has any effect. In statistical terms, this 


means X and Y are exchangeable. Thus the null hypothesis can be expressed as 
Ho: | X and Y are exchangeable. 
= Ho: (X,Y) È (Y, X) 


= Ho : F(x,y) = F(y, x) V (x,y). 


Define D = Y — X, Then under the null hypothesis distribution of D is symmetric 


about the origin. If X and Y differs only in location under the alternative, then 


10 


D-AE —(D — A), where A is the location difference. Naturally, under the null 
hypothesis, the distribution of D has median at the origin but under the alternative, 
the median becomes A. Then the problem reduces to testing Ho : A = 0 against all 
alternatives. However, the median of the distribution of the difference is not always 
the difference of the the marginal medians. If the marginal distributions and the 
distribution of the difference are all symmetric, then median of the distribution of 
difference and the difference of the two medians coincide(see, Gibbons and Chakraborti, 


2006, for details). 


5 Distribution free confidence interval for quantiles 


Suppose X;,7 = 1,2,..,n are iid observations from a continuous but unknown DF F(z). 
The objective is to provide a confidence interval of the p th order quantile £, satisfying 
F(&,) = p. Since £, is a population quantile , it is natural to consider intervals based on the 
sample quantiles as a confidence interval. Thus we can start with X(,) (i.e. the sample 7 th 
quantile) and X(, (i.e. the sample £ th quantile). That is, we suggest to consider the interval 
(Xr), X(s)) with r < s as a confidence interval for £j. Now we shall show that the coverage 
probability of [X(,), X(s)] is independent of any F. Note that for any k, Xæ) < €, © Z 2 k, 
where Z has a binomial distribution with parameters n and success probability F(£,) — p. 


Thus 


P{[Xe); X] 2 Sp} Pu oF ayes) 


"5 (ra a> (va _ py 


s—1 
Ty d n—i 
. Y (Dra-» = Y(n,r, s) 


Thus |X(), X(s)] is a confidence interval for €p with confidence coefficient y(n, r, s). Clearly, 
^y(n,r, s) is independent of any F and hence [X(,), X(,)| gives a distribution free confidence 


interval of é. However, in practice the confidence coefficient is set at least (1 — a) with 


11 


specified a and hence we choose s and r in such a way that 37. (p (1—p)** 21-—ais 
satisfied. 

But such a choice is not unique as we are determining r and s from a single equation. 
However, if we want to get a confidence interval with smallest length, we can determine r 
and s such that E(X(, — X(,)) is a minimum for a given a. The most common approach 
to achieve uniqueness is to assign equal probability at each tail. That is, for a given o, we 
determine r and s satisfying X; (2)p/(1 — p)! € € and 355, (?)p'(1 — p) € 2. As 


another alternative r + s is taken as n + 1 so that the confidence limits become equidistant 


from the two endpoints. 


12 


Nonparametric Inference: Module 8: 


What we provide in this module 


e A motivating example explaining the need of alternative hypothesis testing procedure 


Sign Test: Definition and Properties 


Intuitive justification of the form 


Optimality of Sign test 


Consistency of Sign test 


1 Co-Ordinator: Dr. Rahul Bhattacharya, Department of Statistics, Calcutta University 


1 A motivating example 


We start with a motivating example. Consider the following data on the lifetime of an 
electric equipment: 

1046.541 1110.841 1259.690 1014.233 1156.425 1001.439 1299.962 1116.045 

1022.895 1106.415 1023.236 1093.674 1103.354 1005.930 1202.124 1001.251 

1129.546 1051.215 1043.066 1054.430. 

Suppose interest is to test the null hypothesis that the mean lifetime is 1050 hours. The 
usual practice is to use Student's t distribution for the purpose. But the question is natural 
" Whether normality assumption holds?" We perform some exploratory data analysis. We 


provide below the histogram and Q-Q plot for the data. 


Normal Q-Q Plot 


© 
S4 qi E o 
o 
o 
A | o 
Oo 
S4 © | 
o ac 
o 
E o J 
[DP S + 
$34 a 
A "9 $ G7 
E 
d o 
S 5] 
8 4 
oO wo 
o ae 
l 
2 2 
8 - 71° oœ 
o T T T T 
1000 1050 1100 1150 1200 1250 1300 -2 
Data Theoretical Quantiles 
(a) The Histogram (b) Q-Q plot 


Figure 1: Histogram and Q-Q plot of the data 


The histogram shows that the distribution of the data is far from symmetric. The QQ 
plot clearly reveals the non normality of data. Then t test is not appropriate.A test for this 


data will be appropriate , if the assumed distribution is appropriate. However, deciding an 


appropriate distribution suffers from subjectivity and no such thumb-rule is present. Thus, 


we need alternative procedures to test the hypothesis appropriately. 


2 What is Sign Test? 


This is a nonparametric analogue of Student's t test for the mean(i.e. a location parameter) 
of the population. The t-test is based on the assumption of normality of the underlying 
population. Sign test is a nonparametric alternative to t test. This test does not require 
the assumption of normality. It also provides a test of location but uses quantiles of the 
distribution as the location parameter. Moreover, sign test is based on only the continuity 


of the underlying population. 


2.1 Assumptions & the hypothesis 


Suppose X1, X5,.., Xn are iid observations from a population characterised by the DF F, 
where F is unknown but assumed to be continuous. Suppose 0(F) is the quantile of order p, 
that is F(0(F)) = p for known p. Then in Sign test, the objective is to test Ho : 0(F) = 9 
against one of the alternatives Ha : 9(F) > 09 or Ha : O(F) < 0o or Ha :0(F) z 0o for some 


known ĝo. For our discussion, we choose p = 0.5 so that 0(F) reduces to the median. 


2.2 Sign test statistic- The intuitive argument 


Note that if the observed data is consistent with 0(F) = 09, then one can expect that almost 
50% values of the data set lie above and below 6). This suggests to use the number of 
observations exceeding fọ as the test statistic. Formally this implies the use of the statistic 
S(09) = X; I(Xi — 09 > 0) Since S counts the number of positive signs among X; — bo, i = 
1,2,.., n, the test based on S is called Sign test. 


2.3 Sign test statistic: Another look 


Assume that 09 = 0. Then 0(F) > 0 & F(0) < .5, that is P(X, > 0) > .5. Similarly, 
OF) < 0 & F(0) ».5,that is P(X, < 0) > .5. Thus 0(F) > O(or < 0) implies more 


positive(negative) observations. The following figures will make the idea clear. 


o | e. 
LN 2 | 
o © 
o | o J 
o © 
€ 6-0 x 
üt ü 
xj xj 
© [e] 
8<0 
@>0 020 
a | 8«0 aj 
o © 
© Q 
o o 
T T T T T T T T T T T T T T 
3 2 - 0 1 2 8 3 2 + 0 1 2 8 
Xx Xx 
(a) Equal scale (b) Varying scale 


Figure 2: Location of 0 


Consider the alternative Ho : 0 > 0. This suggests to use the number of positive observa- 
tions as our test statistic. Formally this implies the use of the statistic S = $77 I(X; > 0). 


Similarly, the form of the statistic for other alternatives can also be justified. 


3 Distribution of S 


Assume that P(X; = 09) = 0 for every i = 1,2,..,n. Since each I(X; — 0o > 0) can be 
thought of a Bernoulli random variable, S can be looked upon as a sum of n Bernoulli ran- 
dom variables. Since observations are iid, /(.X; — ĝo > 0) are iid random variables. Now the 


distribution of each /(.X; — 69 > 0) is Bernoulli with success probability P(X, > 05). Thus 


S is the sum of n iid Bernoulli random variables with success probability P(X, > 05). 


We see that, S has a Binomial(n, P(X, > ĝo) distribution. Naturally the success 
probability P(X, > 05) depends on the underlying F. However under the null hypothe- 
sis F(05) = 0.5 and hence S becomes a distribution free statistic. Therefore tests based on S 


are exactly nonparametric. 


4 Critical region 


Since S ~ Binomial(n,j), under Ho, E(S) = 2. However, under any 0(F), E(S) = 
n(1 — F(09)). Suppose 0 > ĝo, then it is expected to have more than 50% observations 
exceeding ĝo. Thus S(09) is expected to be larger under 0 > ĝo than under 0 = 0s. Therefore, 
larger values of S(69) indicates evidence against 0 = 69.Naturally a right tailed test based 
on S$(05) seems appropriate for testing Ho : 0 = 0o against Ha : 0 > 09. 

Again less than 50% observations exceeding 09 are expected under 0 < 09. Thus S(05) is 
expected to be smaller under 0 < fo than under 0 = 0$. Thus a left tailed test based on 
S(05) seems appropriate for the alternative Ha : 0 < 0o. However, if 0 Z 05, then S(O.) is 


expected to be either smaller or larger than under 0 = 69. Therefore, a two tailed test based 


on S$(06) is appropriate for the alternative Ha : 0 Z 0o. 


5 Symmetry of S 


Since the distribution of S(05) is Binomial(n,.5) under the null hypothesis, $(05) has a sym- 
metric distribution about 5. We explore the implication of the symmetric nature. Suppose 
we have observed Y; = 20) — Xj for each i, instead of X;. Under 0 = 05, the median of 
the distributions of both Y; and X; is 0g. Then tests for the two sided alternative based on 
Y;s and X;s are expected to give similar results. But test applied on Y's will give similar 


result to that applied on X's, if the statistic has a symmetric distribution. This gives the 


justification of the requirement of symmetry. 


5.1 Different Tests 


Since, S has a discrete distribution, tests based on it will be randomized. For the alternative 
H, : 0 > 0o, a size a test can be expressed as 9? = I(S > S4) + al(S = Sa), where S, is 
such that Eg,9? = a. 


For the alternative Ha : 0 < ĝo, a size o test can be expressed as Q? = I(S < 


Si-a) + al(S = Si-a), where S1. is such that Eg, o? = a. 


For the alternative Ha : 0 4 ĝo, since the distribution of S is symmetric about 5 under 
the null hypothesis, a size a test can be expressed as 9? = I(|S —5| > S«)--aI(|S— 5| = Se) 


with Ej,9? = a. 


5.2 Test based on p values 


Suppose Sy, is the observed value of S. For the alternative H, : 0 > 05, the one sided p value 
is Pu, (S > Sos). We accept the null hypothesis if this p value exceeds a. For the alternative 
Ha : 0 < 0o, the one sided p value is Pg,(S < Sos). We accept the null hypothesis if this p 
value exceeds a. However, for the two sided alternative H, : 0 4 05, the two sided p value is 
2min{ Py (S > Sos), Pu (S € S5,)). We reject the null hypothesis if this p value does not 


exceed a. 


6 Presence of ties 


We have already assumed continuity of F so that P(X; = 09) = 0 for every i = 1,2, .., n. But 
in practice, we can have observations equal to 09. Thus we get some zero's in S. Presence 


of a large number of 0’s can give misleading results. The usual method, in this context 


is to ignore the zero differences. However ties can occur theoretically if P(X; = 69) > 0. 
Suppose, now we are interested in Ho : P(X, > 059) = P(X, < bo). Let S*(S-) denote 
the number of positive(negative) signs in such a case. Then St + $^ < n, and hence the 


conditional distribution of S* given St + S~ is again binomial with parameters St + S^ 


P(X1»00) 
X41»09)--P(X1 « 09) E 


and p — PI Thus p = I under the null hypothesis and therefore, $*) can be 
taken as our new statistic and tests based on it can be performed as earlier. These tests are 


known as conditional sign tests. 


7 Optimality of Sign Test 


Consider testing Ho : 0 = 6 against Ha : 0 > o. Define T'o = {F : F(09) = i and 
Ta = {F : F(0o) < i. Then the above hypothesis testing can be equivalently expressed as 
testing Ho : F € To against Ha : F € Ty. Then Hp and H, are both composite. It can be 
shown that the UMP size a test for the above testing problem is nothing but the Sign test 
based on S. In a similar way the two sided Sign test is UMPU size a(see, Fraser, 1957, for 


details). 


8 Consistency of Sign test 


Consider testing Ho : 0 = 0o against Ha : 0 > 09. Now Sign test can be equivalently expressed 
in terms of : For simplicity assume 09 = 0. Since S has a binomial distribution, we have 
under any 0 = 0(F), 


Pops 
n 


Being consistent with our development, we define u(F) = P(X, > 9). Then u(0) = 3 is the 
value of u(F) under the null hypothesis and hence 


1 


D. 
> z LO > 6 


Thus Sign test is consistent against the alternative Ha : 0 > 0s. Consistency against the 


other alternatives can be proved also. 


Nonparametric Inference: Module 9: 


What we provide in this module 


e Unbiasedness of Sign Test 


Confidence interval estimation using Sign test 


e Large sample test 


Sample size determination 


Applications of Sign Test 


1 Co-Ordinator: Dr. Rahul Bhattacharya, Department of Statistics, Calcutta University 


1  Unmbiasedness of Sign Test 


Suppose X1, X5,.., X, are iid observations from a population characterised by the DF F, 
where F is unknown but assumed to be continuous. Suppose 0(F) is the quantile of order 
p, that is F(0(F)) = p for known p. Consider testing Ho : 0(F) = 6o against the alternative 


H, : 0(F) > 09. Then the power function can be expressed as 


Eg? = Po(S'(o) > Sa) + aP9(S(00) = Sa) 


aPs(S(69) > S, — 1) + (1 — a) Po(S(00) > Sa) 


Now observe that S(09) = $5.4 I(X; > 60) is expected to be larger under 0 than under 
o. Thus S(05) under 0 is stochastically larger than that under 6). Then we find that 
Ej? > Po,(S(00) > Sa) + aPo,(S(60) = Sa). It is easy to observe that the RHS of above 
equals Ey,¢° = a. Thus we find that Egó? > a for any 0 > ĝo and hence unbiasedness 


follows. Unbiasedness for the other tests can also be proved in a similar way. 


2 Confidence interval using Sign test 


In parametric inference, we often obtain a confidence interval from the acceptance region 
of the test. The same technique, though difficult, can also be adopted in nonparametric 
procedures. Thus we can use the cut offs of two sided Sign test to get a confidence interval 
of 0 with confidence coefficient at least (1 — a). Consider the acceptance region of the non 
randomized level a Sign test for Ho : 0 = 0 against Ha : 0 #0. With the already introduced 
notations, the acceptance region can be expressed as c € S(0) € n — c, where c is such that 
Pg,(c € S(0) < n—c) > 1— a. We have already seen that S(0) is non increasing in 0. Then 


using the properties of order statistics, we at once obtain 
S(O) € n—c«4»0 2 Xio 


and 


S(0) 2»ce0« AX oscar) 


2 


. Thus c € S(0) < n — cis equivalent to X < 0 < X(n—c41). Then 


P(X io, X(n—c41)) 3 0) = Po(c < S(0) <n- c). 


Since S(0), 2 S(0) 


jjj, we have 
P(c € S(0) < n —c) = Py, (c < S(O) € n— c). 


Since the test is of level o, the RHS probability in the above is at least 1 — a. Thus 
the coverage probability of the random interval [X(5, X(n—c41)) is at least 1 — a. Hence 


[X (o5; X(n—e41)) is a confidence interval for 0 with confidence probability at least 1 — a. 


3 Large sample test 


Under the null hypothesis, S ~ Binomial(n, 1). Now by DeMoiver-Laplace limit theorem 


piis m d. asymptotically (0,1). Therefore, different tests can be performed in large 


/5 


samples using S*. Consider testing Ho : 0 = 0, against H, : 0 > 09. Then the corresponding 


large sample test is non-randomized and rejects the null hypothesis if the observed value 
of S* exceeds 74. Similarly, the large sample tests for the other hypotheses can also be 


constructed. 


4 Sign test for quantiles 


Suppose p # 0.5, that is 0(F) is the quantile of order p. Sign test can be still used to test 
Ho : 0(F) = 05 against the usual alternatives. Properties like consistency and unbiasedness 
will be retained. However, the distribution of $(05), in this case will be Binomial(n, 1 — p) 


under the null hypothesis. Naturally the statistic for the large sample test will be S* = 


S—np 


a which is asymptotically N(0,1) under the null hypothesis. 
npii—p 


5 Paired Sample Sign Test 


Suppose (X;, Y;)i = 1,2,..,n are sample observations from an unknown continuous bivariate 
distribution F(x,y). Suppose the median of the distribution of difference Z = X — Y 
is 07. The objective is to test Hy : 0z = o against the usual one sided and two sided 
alternatives. Naturally, 07 = ĝo indicates that X observations tend to be ĝo units larger than 
the corresponding Y observations. Thus statements about 07 give information about the 
relative locations of the marginal distributions. Then the appropriate procedure is simply 


Sign test based on the new sets of observations Z; = X; —Y;,i = 1,2,..,n. 


6 Sample size determination 


To perform a Sign test, we need a random sample. Suppose the objective is to determine 
a shift in the median. Consider testing Ho : 0 = 0 against Ha : 0 = 0;(> 0) for specified 
01. Now the test can be based on a small sample size or a large sample size. But often the 
observations are subject to some cost and time constraint. Therefore, the experimenter needs 
to choose the sample size, which will be sufficient to reach a decision. The usual technique 
is to determine n in such a way that the test has size œ and power at the alternative 1 — 5, 
where o and f are specified in advance. 

Consider the non-randomized level a Sign test for Ha : 0 = 0ı, which rejects the null 
hypothesis if S # k, where k is such that Py,(S > k) < a. Since S has a binomial 


distribution with parameters n and success probability 0.5, the above condition reduces to 


ik 
Suppose p = P(X, > 0|0;). Then the power of the above test becomes 37; y (p — p). 


Thus we need to determine sample size such that the conditions 


and 


are satisfied. 


6.1 Calculation for various F 


Suppose a is set at 6% and power at 80%. Then we get n = 15 and k = 11. Now we 
take 0; = .8 and consider F as Normal, Cauchy and Logistic. For normal distribution, we 
get power .801 but for the Cauchy distribution it is .57 and for Logistic distribution it is 
.47. Thus for normal population, the required sample size is 15. However, we need more 
samples to achieve 8096 power for the other distributions. Then for normal distribution, we 
get power .851 but for the Cauchy distribution it is .585 and for Logistic distribution it is 
.48. A similar exercise with a = .01 gives n = 25 and k = 18. Then for normal distribution 
, we get power .851 but for the Cauchy distribution it is .585 and for Logistic distribution it 


is .48, which supports the need of further observations. 


6.2 Approximate sample size 


The determination of exact sample size is not easier in practice. We can use the large sample 
approximations to get a simple formula for sample size. Consider testing Ho : 0 = 0 against 
H4: 0 = 01(7 0) for specified 0. The large sample test rejects the null hypothesis if : >c. 


Then the size and power requirements are 
S 
Pu Sc) =a; 
and 
S 


If ø? is the asymptotic variance of vn under H;, i=0,1, then the above requirements can 


be equivalently expressed as 


S 
PH 5 nic — .5 
Py (v/n? m > ie ) 


)=a, 


and 


S. NT 
Py (Vn > Vn = 1-8. 


Oo 01 
Then we get the approximate relations c — .5 — mo andc—p-— E and hence 
_, (0178 + FoTa)* 
(p-.5P "' 
where o2 = i and o? = p(1 — p). The same formula is valid for testing Ho : 0 = 0 against 


H4: 0 = 04(« 0) for specified 01. For two sided alternative, we need to replace Ta by Ta. 


6.3 Comparison of approximate sample size 


For the purpose of comparison, we consider three distributions, namely, Normal, Cauchy and 
Logistic. The median for each distribution is taken as 0 and scale parameter unity. Then 
we consider testing Ho : 0 = 0 against Hı : 0 > 0. For various choices of 0(> 0), we have 
computed the sample size by the derived formula with o = .05 and 8 = .2. The nature of 


the approximate sample size is plotted in the next page. 


6.4 Observations 


The plot depicts the same fact as is observed in the exact numerical study. Normal dis- 
tribution takes the lowest number of samples to reach the desired power level among the 
candidates. Logistic distribution takes the highest number of observations to reach 8096 
power. In addition, as we increase 0, the required sample size decreases for each candidate. 


This is a consequence of increasing power functions. 


7 Application of Sign Test 


7.1 Test of Trend: Cox-Stuart test 


For data coming from some measurement processes, it is often desirable to check the presence 


of trend. That is, we are interested in knowing whether the observations depend on time. 


Logistic 


80 
| 


60 


Sample Size 


20 
l 


Figure 1: Approximate Sample size for different F 


Suppose X;, i = 1,2,..,n are iid observations from a continuous population F. The hypothe- 
ses for such tests are expressed as Hp : Absence of trend in data against Hı : Presence of 
upward trend in data. 

Assume that the observations are sequentially observed and X1, X2, .., X, are ordered as 


they are observed. Also assume that n is even and define c = 5(if n is odd, the middlemost 


observation, that is the at th observation is removed). Then, the whole set of observations 
are grouped into X;,27 = 1,2,..,c and X;,2 = c+1,c+2,..,n. We club the observations 
into pairs (X;, Xi+c), i = 1,2,..,c. If an upward trend is present, the event X;,, > X; is 
more probable than the event Xipe < X; for every i. If p= P(Xj,, > Xi), then the testing 
problem can be expressed as testing Ho : p = i against Hi: p> L. Thus the problem can 
be looked up on as that for the Sign test. 

Then the test statistic T is the number of pairs (X;, Xi+c), i = 1,2,.., c for which X; < Xipe- 


That is T = $5; 4,I(X; < Xi+-). Thus the test statistic is nothing but the Sign test statistic 


based on X;,,— Xj, = 1,2,..,c. Naturally higher values of T indicates presence of an 
upward trend. Under the null hypothesis T' has a binomial distribution with parameters c 
and success probability L. Then as usual depending on the given level o, the test can be 


constructed. 


7.2 Test of correlation 


Cox-Stuart test as discussed above can also be used to test for possible correlation. Suppose 
patients are given two drugs, one after another. Since the drugs are applied on the same 
patient, responses are correlated. Assume that a higher response indicates a favourable 
condition. Also assume that the paired response has a continuous bivariate distribution. 
Then the hypotheses in such a case can be expressed as Hg : Absence of positive correlation 
against Hı : Presence of positive correlation. 

For details, assume that the response of the i the patient for the first drug(second drug) is 
X;(Y;),i = 1,2, .., n. Order the pairs according to the increasing values of the X observations. 
For example if three pairs of observations are (5,3),(3,2) and (7,8), then the ordered pairs, 
ordered according to the magnitude of the first elements, are (3,2),(5,3) and (7,8). 'Then 
testing existence of positive (negative) correlation is equivalent to testing the presence of an 
upward(downward )trend in the ordered Y observations. Thus the test becomes the same as 


Cox-Stuart test of trend determination on the ordered Y observations. 


8 Why use a Sign test? 


The full information contained in the observations are not used in Sign test. Thus Sign test 
is less powerful, so one must use t test. But t test is based on normality. If the underlying 
distribution is non-normal, optimal test is rare to exist. In addition, the assumption about 
the underlying distribution is often instrumental. Thus Sign test is a safe option when there 
is any doubt about the normality of the underlying population though a sacrifice in the 


power. 


Nonparametric Inference: Module 10: 


What we provide in this module 


e Wilcoxon Signed rank test 


Properties 


e Exact null distribution 


Form of critical region 


Modification in the presence of ties 


1 Co-Ordinator: Dr. Rahul Bhattacharya, Department of Statistics, Calcutta University 


1 Wilcoxon Signed Rank Test 


This is another nonparametric alternative to Student's t test with the additional assumption 
of symmetry. This test is developed by Frank Wilcoxon(1945) but popularized by Sidney 
Siegel(1956). The procedure utilizes the signed rank of the observations and provides a 


distribution free test for location. 


1.14 Assumptions & the hypothesis 


Suppose X1, X5, .., Xn are iid observations from a symmetric location family of distributions 
F. Then F(x) = F(x —0(F)), where 0 = 0(F) is a location parameter. F is assumed to 
be continuous and symmetric , that is, F(x) + F(—x) = 1 for all x. Under symmetry @ is, 
therefore, the median of F. Then our objective is to provide a test for Ho : 0(F) = bo against 
the usual one or two sided alternatives. However, without any loss of generality, we can take 


05 = 0 


1.2 Signed Rank 


The continuity assumption ensures that P(X; = 0) = 0 for all i and that the observations 
are distinct with probability one. Suppose the observations are ranked in order of absolute 
value and ranked accordingly. Suppose R; is the rank of |X;| among (|Xi|, |X9|,..,|Xn|}- 
'Then signed rank of an observation is the rank of its absolute value multiplied by the sign 
of the original observation. If Z; = I(.X; > 0), then the signed rank of the i th observation 
is ZiR; ,i = 1,2,..,n. The signed-rank sum T is defined as the sum of the signed ranks, i.e. 
T = X`; ZR;. We shall explain using the following example. Consider the following data 
with n — 10. 


IX; |7|9|13|14|16 23 11 |2 6 |4 
ZRh|4|5/7 |8 |9 |10|-6 |1]-3]} 2 


Then the signed rank sum is T' = 37. 


1.3 T asa test statistic 


Note that X; is expected to be larger under 0 > 0 than under 0 = 0. Thus a large(small) 
value of T implies that most of the large deviations from 0 are positive(negative). Therefore, 
a large(small) T is an indicator of positive(negative) 0. Then it seems reasonable to reject 
the null hypothesis against Ha : 0 > 0(H, : 0 < 0) if T tend to be too large(or too small). 
Similarly too large and too small values of T' indicates possible rejection of the null hypothesis 


against Ha : 0 Æ 0. 


2 T is distribution free! 


Now we shall show that the distribution of T does not depend on any F under the null 
hypothesis. Before we proceed further, we introduce the concept of antirank or inverse rank. 
If R = (Ri, Ro,..,R,) is a rank vector, the antirank vector is D, provided R o D = 
[Top s os oy ep) = (1,2,. n). 

Consider an example with n — 5. Suppose R — (3,2,4,1,5) then 


RoD= (Rp,, Rp», n Rp,) — (1,2,3,4,5) 


only when D = (4, 2,1,3,5). Thus D is the antirank of R. 
Since T is the sum of few numbers from the set of first n integers, we can arrange the ranks 
in such a way that the i th rank comes in the i th position. If D is the antirank vector 
corresponding to R*, we can express T as T = Yai iZD;. 
Assume that X; ~ F(x), where F(x) + F(—x) = 1 for all x. Naturally F has median at the 


origin. Then for every x, it is easy to observe that 


P(Z-L|X|sz) = EFE) -1) 


= P(Z -IP(X;sz) 


Since Z; and | X;| are independent for any i Æ j, the desired independence follows. Now, being 
a function of [X;|, i =1,2,..,n, R" is independent of Z;,i = 1,2,..,n. Thus Z; i = 1,2,..,n 
and Dj,i = 1,2,..,n are independently distributed. 

Here D is a random permutation over P, the set of all n! permutations of (1,2,..,n). Again 


Zi ~ Bernoulli(1) for every i. Thus 


P(Zp,— 54 Vi) = M P(Za = z Vi[D = d)P(D = d) 
dc? 
= MIP(Z,-Vi)P(D = d) 
de? 


= YGP - 4) 


Thus Zp,,? = 1,2,..,n are iid random variables with Bernoulli(1) distribution. 

Hence T is the weighted sum of iid random variables Zp,,7 = 1,2,..n. Under the symmetry 
about origin, Zp,,i = 1,2,..n are iid Bernoulli() variables. Thus under Hp : 0 = 0, 
distribution of Zp,, i = 1,2, ..n and hence distribution of T is independent of any F. Thus T 
is exactly distribution free under the null hypothesis. Therefore tests based on T are exactly 


nonparametric. 


3 Exact distribution of T 


Consider the representation T = 37; , iZi, where Z; are iid Bernoulli(). Then T takes the 


n(n+1) 


values from 0 to 5 


and under the symmetry of the underlying population 


b(t, n) n(n + 1) 
2” 2 i 
where b(t,n) = #(21, 22, ..., 24) : Dj, izi = t and z; are either 0 or 1. Although no closed 


P(T =t) = 


t=0,1,2,.., 


form expression exists for the distribution of T', we can compute the distribution for smaller 


values of n using the above definition. 


We can calculate b(t, n) using the recursion equation b(t,n) = b(t,n — 1) + b(t — n,n — 1), 


where with b(t,n) = O fort < Oort > noH) and 6(0,n) = b(1,n) = 1 for all n. The 


recursion relation follows from the fact that 
n-1 
b(t,n) = {#(21, 22t 2): 3 in =t—n} 
i=l 


n—1 
+ GG, oy Us zonae : So iz = t} 
i=1 


We give the calculation of the distribution for small values of n. 

Suppose n=2, then possible values of T are 0,1,2 and 3. Then 6(0,2) = 1 = b(1,2). Now 
using the recursion relation b(2,2) = b(2,1) + (0,1) =0+1= 1. Again applying the re- 
cursion, we get b(3, 2) = b(3,1)+6(1, 1) = 0+1 = 1. Then we have the following distribution: 


zu 
> 
I 


w N = [e 
Ble Ble slm Ble 


Next assume n=3, then possible values of T are 0,1,2,..,6. Then 5(0,3) = 1 = 6(1,3) = 1. 
Now using the recursion relation and the values of b(., 2), we get b(2, 3) = b(2,2)+b(—1, 2) = 
1+0=1, 5(3,3) = 0(3,2) +b(0,2) =1+1=2, b(4,3) = b(4,2) + 0(1,2) = 04+ 5(1,2) = 1, 
b(5,3) = b(5,2) + (3,2) = 0 +1 and (6,3) = b(6,2) + b(3,2) =0 +1 = 1. Then we have 


the following distribution: 


t 0/1/2|)3]4/51|6 
pe=9 [alae hs 


3.1 Symmetry of the distribution of T 


If we look at the distribution for n = 3, we can observe symmetry about n(n + 1)/4 = 3 


Define uo = n(n + 1)/4, then 


P(T =t+ po) = P(X iZ; = t+ uo) (as Z; ~ Bernoulli(4)) 
4=1 


B PI iz = uo — t) (as Zi 2 1 — Z for all i) 


i=l 
This holds for every t € (0,1,2,..,n(n + 1)/2}. Thus the distribution of T is symmetric 
about uo = n(n + 1)/4. 


3.2 Summary measures of the distribution of T 


Since the distribution of T is symmetric about uo = n(n + 1)/4, we get E(T) = n(n + 1)/4. 
To find the variance, we use the representation T = $77 , iZ;, where Z; are iid Bernoulli(1) 


variables. Then 


Var(T) 


Vul iZi) 
i=1 


X PVar(Z;) 
i=1 


Since Var(Z;) = į and $5? = eee we get Var(T) = = 


4 Different Tests 


T has a discrete distribution and hence tests based on it will be randomized. We list below 
the tests for different alternatives. For the alternative H, : 0 > 0, a size a test can be 
expressed as 


o? = I(T > Ta) +aI(T = Ta), 


where Ta is such that Emo? = o. 


For the alternative H, : 0 < 0, a size œ test can be expressed as 
o? = I(T < Ti-a) +al(T = Ti-a), 


where Ti—a is such that Ex, ¢° =a. 


However, for the alternative Ha : 0 Z 09, we note that the distribution of T is symmetric 


n(n+1 


about uo = =~; ) under the null hypothesis. Hence a size a test can be expressed as 


d? = I([T — uo] > Ta) + al (\T — pol = Ts) 


with Ey,¢° = a. 


4.1 p values 


Suppose 755, is the observed value of T. Then for the alternative Ha : 0 > 0, one can report 
the one sided p value is Py,(T’ > Tos). We accept the null hypothesis if this p value exceeds 
a.For the alternative Ha : 0 < 0, corresponding one sided p value is Py,(I’ < Tos) and we 
accept the null hypothesis if this p value exceeds a. But for the alternative Ha : 0 Z 0, the 
two sided p value is 2min( Pup, (T > Tos), Pu, (T < Tovs)}. Thus we reject the null hypothesis 


if this p value does not exceed a. 


4.2 Presence of ties 


Presence of zero and tied observations undermine the validity of the test. Pratt(1959) rec- 
ommended a modification of T' in such a situation. He suggested to use the zeros and average 
ranks for the tied observations for the modification of the usual statistic. 

To be specific, suppose 0 < uy < ug < .. < Um be the distinct absolute magnitudes 
of the data X;,X»,., Xn. Suppose fo is the frequency of zeroes in the data. f; (f; ) 
is the frequency of positive(negative) u; and f; = f; + f; is the total frequency of fre- 
quency of u; in the data. Then Pratt suggested to use the statistic T* = 377 , wif; , where 


7 


w; = fo fi. + fj-ı + (f; + 1)/2. However, only large sample tests are suggested based 


on the asymptotic normality of the standardised statistic. 


Nonparametric Inference: Module 11: 


What we provide in this module 


e Representation of T in terms of positive Walsh averages 
e Consistency of Signed rank test 

e Unbiasedness of Signed rank test 

e Confidence interval using the cut offs 

e Asymptotic normality & large sample tests 

e Test for symmetry 

e Sample size determination 


e Comparing Signed rank test and Sign test 


1 Co-Ordinator: Dr. Rahul Bhattacharya, Department of Statistics, Calcutta University 


1 Representation in terms of Walsh averages 


Suppose X1, X», .., Xn are iid observations from a symmetric location family of distributions 
F. Then F(x) = F(x — 0(F)), where 0 = 0(F) is a location parameter. F is assumed to 


be continuous and symmetric , that is, F(x) + F(—x) = 1 for all x. Under symmetry @ is, 


therefore, the median of F. Define Walsh averages W;; = mn 1<i<j<n. Then T can 
be expressed as the number of positive Walsh averages, i.e. T = YA Ny (%45 > 0). 


Observe that for j > i, Xa < X(jj and hence (Xa + Xo > 0) = 1 if Xg > 0. If 
Xa « 0, then Xa + Xg) > 0 implies that |X| < X(jj for Xg > 0. Thus IXa + Xo) > 
0) = 1 iff [X(3| < Xg) and X(jj > 0. For i=j, (Xa + Xj > 0) = I(X(j > 0) and hence 
I(Xo > 0) 2 1 if Xo; > 0. 


Hence, for Xo) > 0, in 37 (Xo + Xo) > 0), only those i will contribute for which 
Xol « Xi). Thus ) |I(Xqo + XQ; > 0) is the signed rank of Xy) for Xj > 0. Then 
R} = er (Xi + Xg) > 0) for XG) > 0. Equivalently we can write the above as 
I(Xq > 0)R; = j IXa + Xo) > 0). Summing the both sides over j = 1 to n, the 


desired expression follows as $5 I(Xo) > 0)R; is nothing but Da ZR. 


1.1 Representation in terms of U statistic 


Considering the representation in terms of positive Walsh averages, we have 


AEK 


(i) U, + (2) Us, 


where U; is a U statistic with kernel /(X, > 0) and U2 is a U statistic with kernel [(X,+X2 > 
0). 


2 Consistency of Signed rank Test 


Consider testing Ho : 0 = 0 against Ha : 0 > 0 using T. Signed rank test can be equivalently 


performed in terms of aor. From the convergence of U statistic, we find for any 0, 
rm 


U, 5 P(X, > 0) 


and 
URP 3x xg). 
'Thus 
T p 
EREN — P(X, + Xo > 0). 


2 
Now define u(F) = P(X; + X2 > 0). Then , under 0 = 0, 


uF) = EI(X; 4 Xa » 0) 


= EE(I(Xi + Xo >0 


SS 


|X2} 
= E(1 = F(—X3)) x 


NI = 


Since for any i, X; ~ F(x — 0) & X; — 0 ~ F(x), the event X; + Xə > 0 under 0 = 0 is a 
subset of that under 6 > 0 and hence P(X, + Xə > 0) > 1 for 0 > 0. Thus u(F) > or = 1 
as 0 > 0 or 0 = 0 and hence Signed rank test is consistent against the alternative Ha : 0 > 0. 


Consistency against the other alternatives can also be proved. 


3 Sign Test is unbiased 


Consider testing Ho : 6 = 0o against Ha : 0 > 09 and the representation T(0) = 4£(X;, Xj) : 


itä >0,i < j. Assume 6) = 0 then the power function can be expressed as 


Egg’ P(T (0) > Ta) + a P(T (0) = Ta) 


= 6P (POs T. itas oar 


Since for any i, X; ~ F(x — 0) & X;—90 ~ F(x), T(0) is expected to be larger under 
any 0 than under 0 = 0. Thus under any 0 > 0, Po(T(0) > k) € P&-o(T(0) > k) for any 
k 2» 0. Thus Ego? > P4-9(T(0) > Ta) + aPo=0(T (0) = Ta) Since the distribution of T(0) 
under any 0 is the same as that of T'(0) under 0 = 0, we find that the RHS of above equals 
Eo? = a. Thus we find that Eoo? > a for any 0 > Oo and hence unbiasedness follows. 


Similarly unbiasedness for the other tests can be proved. 


4 Confidence interval for 0 


We can use the cut off of Signed rank test to get a confidence interval of 0 with confidence 


coefficient at least (1 — a). Now T(0) can be looked upon as a sum of m = nr positive 


Walsh averages. Then the acceptance region of the non randomized level a test for Ho : 
0 = 0 against H, : 0 # 0 can be expressed as c € T(0) € m — c, where c is such that 
P(c € T(0) € m —c) > 1— o. Then from the theory of order statistics, we find that 


T(0) € m— c €& 0 2 Wio 


and 


T(8) > c & 0 < Weg, cay 
where Wa) are the ordered Walsh averages. Thus 
e < T(8) < m — c & We < 0 < Wim-0+1): 
Hence Pa{ (Wio, Wim—c+1)) > 0} = Pale € T (0) € m — c) Since T (0) 2 T(0)bo, we have 
P(c € T(0) € m— c) = Py, (c € T(0) < m—c) 21-— a. 


Hence [W(o, W(m—c+1)) is a confidence interval for 0 with confidence probability at least 1 —o. 


4.1 Finding Walsh averages 


Thus confidence interval depends on the ordered Walsh averages. Now we shall discuss how 


to obtain such averages in practice. Suppose n = 4 and the observations X;, i = 1,..,4 are 


4 


-2,5,8,13. Then we have m = 10 Walsh averages Atl i < j as below 


X | X2 | X3 | Xa 
Xı | -2 
X2]1.5|5 
X3 | 3 6.5 | 8 
X4 | 4.5 | 9 10.5 | 13 


Thus we get W = (—2,1.5,5,3,6.5,8,4.5,9,10.5, 13) and hence the ordered values can be 


obtained. 


5 Asymptotic normality 


Under the null hypothesis, T = 377 , iZ; is the sum of independent but non identical random 


variables. Then under the null hypothesis, as n — oo, 


* 


.T-E(T) 2, N(0, 1) 
Var(T) 


provided Liapounov's condition is satisfied. Liapounov's condition will be satisfied if 


Su EQUAL 
{X VA E 


as n — oo. Since E(Z?) = E(Z?) = i, it is straightforward to show that the above condition 


Ry = —0 


is satisfied. 


5.1 Large sample test 


Since asymptotic normality holds, different tests can be performed in large samples using 
T*. Consider testing Ho : 0 = 0o against Ha : 0 > 0s. Then the corresponding large sample 
test is non-randomized and rejects the null hypothesis if the observed value of T* exceeds 


Ty. Similarly, the large sample tests for the other hypotheses can also be constructed. 


5 


6 Application: Test for symmetry 


Symmetry of the underlying distribution can also be tested using Wilcoxon signed-rank 
statistics. Suppose X;,? = 1,2,..,n are iid observations from an unknown distribution F. 
Assume that F is a continuous distribution. To test Ho F is symmetric with known median 
05. That is under the null hypothesis, X; — 0o 2 05 — X; for each i. We suggest to use T and 
the related critical regions for the situation. 

If Ho is accepted, it is concluded that F is symmetric with median 0). However, rejection 
of Ho is difficult to explain. Consider testing Ho against all alternatives and suppose Hy is 
rejected. Then either F is symmetric with median Æ 6 or F is asymmetric with median 69 
or F is asymmetric with median Z 05. However, such a broad conclusion , in general, lacks 


conclusiveness. 


7 Paired Sample Wilcoxon Signed rank Test 


Suppose (X;, Y;)i = 1,2,..,n are sample observations from an unknown continuous bivariate 
distribution F(x,y). If the median of the distribution of difference D = X — Y is 0p, the 
objective is to test Hp : 0p = O against the usual one sided and two sided alternatives. 
Naturally, 0p = 64(» 0) indicates that X observations tend to be 6, units larger than 
the corresponding Y observations. Thus statements about 0p give information about the 
relative locations of the marginal distributions. Then the appropriate procedure is simply the 


Wilcoxon Signed rank test based on the new sets of observations D; = X; —Y;,4 = 1, 2,..,n. 


8 Signed Rank test & Sign test 


Signed rank test is a nonparametric competitor of Student's t test. So which test is best? 
Of course the test with higher power is preferable. Now we shall give a comparative picture 
of the power of two tests. Consider n=10 observations from a (0,1) population. To test 


Ho : 0 = 0 against H, : 0 > 0 at size a = .056. Then the power function of Signed Rank test 


6 


is P(T > T4) + aPo(T = T4). Ift = Po(X, > 0) = (0), then the power function of Sign 
test is P(S > S4) +aPo(S = Sa), where S ~ Binomial(10,7). For n = 10 and a = .056 we 
find Tẹ = 43 and a = .25. Again for the Sign test we find Sa = 7 and a = .0112. Then we 
draw the two power curves on the same graph paper for varying 0. But the power for the 


Signed Rank test is obtained through simulation. 


Power 


Figure 1: Nature of power for varying (indicated by u) 


8.1 Observation & Recommendation 


The power curve for Sign test is always below than that of the Signed rank test. The reason is 


that, unlike Sign test, Signed rank test incorporates symmetry of the underlying distribution 


through the test statistic. Thus Signed rank test is more powerful. Hence, if the underlying 
distribution is continuous and symmetric(can be checked through the histogram), Signed 
rank test is the most appropriate option. It is always a safe option when there is any doubt 


about the normality of the underlying symmetric population. 


Nonparametric Inference: Module 12: 
What we provide in this module 


e Goodness-of-fit problems 


e Pearsonian chi square tests 


Kolmogorov-Smirnov tests 


Properties : Consistency & Distribution free 


Some other goodness-of-fit tests 


1 Co-Ordinator: Dr. Rahul Bhattacharya, Department of Statistics, Calcutta University 


1 What is goodness of fit? 


Often the underlying distribution of the data is not known. Traditional assumption of nor- 
mality makes the situation weak in such cases and therefore the experimenter wish to know 
whether the distribution is of a given form. Thus the objective is to know which distribution 
"fits" good to the observed data. Tests for determining the underlying distribution, which 


is a good fit to the observed data are known as goodness-of-fit tests. 


1.1 A motivating example 


For better understanding, let us consider an example, where a random sample of size 9 is 


taken from a continuous distribution on [0,1] as 


01,.11, .23, .34, .55, .65 , .77 ,.78,.89 


Naturally, the sample size , that is, 9 is too small to assume normality. Again the observations 
are taken from a distribution with bounded support. Thus normality assumption will only 
be a forceful assumption. Suppose from a previous idea, it is known that the distribution 
can be assumed to be a Beta(.5, .5) distribution. Then the usual methods can not be applied 


to test the hypothesis that the observations are coming from a Beta(.5, .5) distribution. 


1.2 Another example 


Suppose the number of misprints per page of a book of 350 pages are represented in the form 


of the following grouped distribution: 


No. of misprints/page | 0 1 12 of ae) 4 


Number of pages 250 | 84} 16 5 | 0 


Here the number of observations is quite large to assume normality. But the variable is 
discrete with small number of categories. It is well known that the number of misprints/page 
has, in general, a Poisson distribution. But the mean, i.e. the only parameter, is not specified 
in advance. Then with the usual methods, testing the hypothesis that the observations are 


coming from a Poisson distribution with unknown parameter is not possible. 


1.8 A further example 


Now consider a data on the heights of 122 students in a certain class in the form of the 


grouped distribution given below: 


Height(cm.) | 154-159 | 159-164 | 164-169 | 169-174 | 174-179 
# students | 4 32 54 27 5 


For the above data, the number of observations is quite large to assume normality. But 
the data is presented in the form of a frequency distribution. Moreover, the distribution of 
height is known to be normal. But even under the assumption of normality, the parameters 
are not given. Thus we need to develop further methods to check that the observations are 


coming from a normal distribution. 


1.4 Related observations 


From these examples, it is observed that: 


e We often have data from an unknown continuous distribution with support other than 


the whole real line. 
e We can have data from a known discrete distribution but with unspecified parameter(s). 


e We can also have data from a known continuous distribution with unspecified param- 


eter(s). 


In these examples the data is given in either in the form of a frequency distribution or in 


the full and the objective was to test whether a given distribution fits the data well. 


2 The hypothesis for goodness of fit 


Suppose X;, i = 1,2, .., n are iid observations from a distribution with DF F(x). Then in the 
goodness-of-fit problems the objective is to test the null hypothesis 


Ho: F(x) = Fo(x) Vx 


against 


H,: F(x) Z Fo(x) for at least one x. 


Now, we have two possibilities: Fo(a) is completely known or Fo(r) is not completely speci- 
fied, i.e. some parameters remain unknown. For completely known Fo(), the null hypothesis 
is simple but in the other case the null hypothesis is composite. However, in each case the 


alternative is composite. Now we develop a number of tests for such hypotheses. 


3 Pearsonian Chi-square test 


Suppose, we have data in the form of a frequency distribution. For discrete distributions, the 
categories are natural(i.e. how many observations are zero or 1 or 2 etc?). But for continuous 
data, the experimenter prepares his/her own frequency distribution. Obviously conversion of 
data in the form of a grouped frequency distribution incurs a loss in information. Naturally, 
the lower the number of categories, the higher is the loss. Then, it is reasonable to classify 
the data set in to a higher number of categories. 

Suppose n observations are available from the distribution of X. Let the range of X be 
divided into mutually exclusive and exhaustive sets A;,i = 1,2,.., k with f; as the number of 
observations classified in A;. Then 3m f; =n. Now define p; = Pg (X € Aj), i =1,2,..,k 
as the probability that an X observation falls in A; under the null hypothesis. 


3.1 F(x) is completely specified 


For completely specified Fo(r), p; are completely known under the null hypothesis with 
Da pi = 1. Now the joint distribution of (X1, X», .., X,—-1) is k-1 variate multi-nomial with 
parameters (n, pi, po, .., Dy 1). Since E(f;) = npi,i = 1,2,.., k — 1, under the null hypothesis 
f; are expected to be closer to np; for each i. Then departure from the null hypothesis can 
k — (fi-npi)? 


be measured by f; — np;. Pearson's Chi-square test uses the statistic Uy = $7; 4 


Cine O 


measure the overall discrepancy. 

It can be shown that under the null hypothesis, U, + X; ,8sn + co. Thus under the null 
hypothesis, the asymptotic distribution is independent of any F. Therefore, test provided 
by U, is nonparametric. Now the higher is the discrepancy , the larger is the value of Us. 
Therefore, a higher value of the statistic indicates departure from the null hypothesis. Thus, 
the large sample test rejects the null hypothesis if Up > Xk-Lo at 100(1 — a)% level of 


significance. 


3.2 Fo(x) is not completely specified 


Suppose Fo is specified except some parameters. Let the unknown parameters of Fo be 
(01,05, ..,0,). Then p; = Py,(X € A;),i = 1,2,.., k are specified except for some parameters. 
In fact, p; depends on the unknown (61, 02,..,6,). Thus, in such a case p; = pi(01, 05, ..,0,), 
a function of the unknown parameters. Then we need to estimate p; from the data to 
measure discrepancies. Suppose 6;,i = 1,2, .r be the ML estimates based on the data. Then 
i = pilO1, 65, ..,0,), 4 = 1,2,., k are the ML estimates of p;,i = 1,2,.., k. Pearson's Chi- 


square statistic, in this situation is taken as V; = bd i hi Gne 


As earlier, a higher value 
of V;, indicates possible departure from the null hypothesis. 

Now , it can be shown, as earlier that under the null hypothesis, V; e Xp asn > 
oo. Thus under the null hypothesis, the asymptotic distribution is independent of any F. 


Therefore, test provided by V; is asymptotically nonparametric. Since a higher value of V; 


indicates departure from the null hypothesis, a right tailed test based on Vj is appropriate. 


Thus, the large sample test rejects the null hypothesis if Vp > Xk—r—La at 100(1 — a)% level 


of significance. 


3.3 Chi-square test: Some facts 


Now we shall discuss some important facts related to the above test. First of all, note that 
the choice of k is subjective, too small k fails to capture the features of the underlying 
distribution. Thus the experimenter needs to classify data in to as many categories as 
possible to gather more information about the underlying distribution. Often the data is 
available in the form of a grouped distribution. If np;(or its estimate) is less than 5 for 
some i, the corresponding class is pooled with one or more neighbour classes so that the 
expected frequency for the combined class is at least 5. However, in such a case, degrees of 
freedom becomes number of classes after combining minus 1 minus the number of parameters 


estimated. 


4 Kolmogorov-Smirnov tests 


Next we discuss another test, which needs full data i.e. data is not available in the form of 
a grouped frequency distribution. Assume that Fo is continuous and completely specified. 
Since, we need to measure the discrepancy between F(x), the actual distribution and Fo(z), 
a postulated known distribution, we use an estimate of F(x). From the theory of U statistic, 
the empirical DF F;,(x) is the U statistic for the estimation of F(x). Then it is known that 
E(F,(x)) = F(x) and for each fixed z, 


vVntF,(x) — F(z)) 5 N(0, F(z)(1 — F(@)}. 


as n — oo. 


4.1 Kolmogorov-Smirnov distance metric 


Now to develop a measure of discrepancy, we consider the plots of ecdf based on n iid 
observations from F(x) and DF Fo(x). Naturally ecdf is a step function whereas DF is 
continuous. For demonstration, we plot ecdf F (x) based on 10 observations from N(0, 1) 
and take Fo(x) as N(-.5,1),N(0,1) and N(.5,1) , respectively. All these are given in the 
following plots. 


o | o 
= B 
o | o 
e eo 
ae y RS 
eo Xo 
Z Z 
Ww Ww 
= E 
5€ a 
Ax ve x 
IL o wo 
e e 
eo e 
o | Q 
eo e 
T T T T T T T T T T 
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 
x x 


Fa(x)(Fo(x)) 


Consider the figures and look at the vertical distance between F,(x) and Fo(x) for fixed 
x. Observe that the distance is a measure of departure from Fo(x). However, for some z, the 
distance is lower and for some x, the distance is higher. Therefore, we consider the maximum 
of such distances to measure the distance from Fo(r). Naturally higher values of maximum of 
such distances would indicate departure from Fo(x). This suggests to use the distance metric 
sup, (F,(r) — Fo(z)) or sup,(Fo(x) — F,(x)), which are known as Kolmogorov-Smirnov(KS) 


statistics. 


4.1.1 Kolmogorov-Smirnov test: Critical region 


Suppose the alternative is Ha : F(x) > Fo(x) for some x. Then our statistic is Dt 
sup,(F,(x)— Fo(r)) and we reject Ho if D} is too large. For the alternative is Ha : F(x) < 
Fo(x) for some x we use D; = sup,(Fo(x) — F,(x)) and we reject Ho if D} is too large. 
However, for the alternative is Ha : F(x) Z Fo(x) for some x, we use D, = maz( D}, D.) = 
sup, |Fin(x) — Fo(x)| and we reject Ho if Dn is too large. The exact values of the cut off can 


be obtained from the tables of Owen(1962) for specific choices of n and level o. 


4.1.2 Kolmogorov-Smirnov Large sample Test 


Now it can be shown that 4n D}? e x2 as n — oo. Thus a large sample test for Ho : F(x) = 
Fo(x)Vr against Ha : F(x) > Fo(x) for some x rejects the null hypothesis if 4nD7? > x$,. 
Since Dt 2 D due to symmetry, a similar test against the alternative Ha : F(x) € Fo(x) 
for some x can be obtained. Again, it can be shown that lim, ... P(.,/nDn € z) = G(z) = 
1—237*,(—1)7!e-7?7, Thus a large sample test for Ho : F(a) = Fo(x) for all x against 
H, : F(x) # Fo(x) for some z rejects the null hypothesis if \/nD,, > da, where d, is a root 
of G(d,) 2 1— a. 


4.1.3 Kolmogorov-Smirnov test is nonparametric 


Now it is natural to inquire whether the test based on Kolmogorov-Smirnov statistic is dis- 


tribution free. For this assume F(x) = Fo(x)Vr. Note that ecdf can be represented as 


8 


Fal) = if Xo <x < Xa for i 20,1,2,.,n with Xo) = 0 and X(n41) = oo. Then D? 


can be expressed as 


sup(Fn(x) — Fo(z)) max — sup — (F(z) — Fo(z)) 


x 0SiSn X i <e@<X 441) 
i 
= max — FY(X()) 
D 
= max(.- Ug), 
where Ug), i = 0, 1,2, ..,n are the order statistics for a random sample of size n from a R(0,1) 
distribution. Thus D; is a function of Ug), i = 0, 1,2, .., n, where U(o) = 0. Since the joint dis- 
tribution of Ug), i = 1,2,..,n is independent of any F under the null hypothesis, distribution 
of D} does not depend on any F. Thus the test given by D$ is exactly nonparametric.Since 
D D D}, the test given by D> 


; is also nonparametric. Again, D, = maz(Dz,D;), and 


hence the corresponding test is also nonparametric. 


4.1.4 Consistency of Kolmogorov-Smirnov test 


Tests based on KS statistic are consistent. For a proof consider the alternative H, : F(x) Æ 
Fo(x) for some x. Then an asymptotically size a test rejects the null hypothesis if 4n D}? > 


2 
X2. 


V7. Now consider the Glivenko-Cantelli lemma,that is, 


X$,, Or equivalently if D} > 
sup(F,(x) — F(x)) 4 0. 
Then using Glivenko-Cantelli lemma, it can be shown that 
D} Er sup(F (x) — Fo(x)). 
Then 
sup(F (x) — Fo(x) = 0if F= Fo 


> 0ifF >F 


Thus from the theory of consistency developed earlier, the one sided test is consistent. 


Similarly the consistency of the other tests can also be established.follow. 


9 


5 Few other goodness-of-fit statistic 


In this module we have discussed only two distance metrics. But there are a number of such 


metrics which can be used as goodness of fit statistic. Some examples include 
e Cramer-Von Mises(CvM) statistic defined by 
Ca = [ (Fale) - RG)y'aRG) 


e Anderson-Darling (AD) statistic defined by 


_ f (Fala) - (2) ye, 
A= [oa c mo 


For either statistic, we reject the null hypothesis if the observed value of the statistic is large. 
Tests provided by these statistics are also nonparametric. The proof follows from the repre- 


sentations: 


1 dcs 
C, xe 2 
n + du ge 


and 


n 


1 f 
nAn = -n — ^ 289 — 1)[logU( + log(1 — Uq ia]; 


where Ug), i = 1,2, .., n are, as before, the order statistics for a random sample of size n from 
a R(0,1) distribution. Thus C, and A, are all distribution free and hence the tests provided 


by Cn and A, are all nonparametric. 


6 An application: Test for normality 


Kolmogorov-Smirnov test can also be applied to test the normality of the given data in small 
samples. or normality, we take Fo(r) as the DF of a standard normal variable. However, 
for the purpose of comparison, we consider the standardised version of the data set(say X*). 
Then application of Kolmogorov-Smirnov test on X* can reveal the underlying normality. 


Therefore, the procedure can be applied to identify non normal data even in small samples. 


10 


What we provide in this module 


Motivation for development 

Definition of Mann-Whitney U 

U as a measure of degree of separation of the two populations 
Null distribution of U 


Summary measures of U 


1 


Motivation 


We start with an example with data from two independent populations. Consider the life- 


time(in 1000 hours)of bulbs corresponding to two different brands(say, Old and New) 


Old 5.4 
New 7.99 


5.22 
7.41 


4.5 
8.35 


4.87 5.5 5.22 5.55 5.23 4.5 5.0 
8.35 7.82 7.48 7.89 7.69 


Then it is natural to know which brand is performing better. The usual practice is to use 


Two sample t test using normality assumption. But we need normality for such a test and 


hence we perform some exploratory data analysis. We provide normal Q-Q plot and boxplot 


for sets of data. 


Sample Quantiles 


Normal Q-Q Plot 


Normal Q-Q Plot 


Sample Quantiles 


-0.5 0.0 0.5 


Theoretical Quantiles 


(a) Old Brand 


Theoretical Quantiles 


(b) New Brand 


Figure 1: Normal Q-Q plots of the standardised data 


The QQ plot for Old brand data shows non normality whereas that for the New brand 


data reveals that the underlying distribution might be normal. Thus t test is not appro- 


priate for this data. Again the box plot shows significant difference between the locations. 


Old New 


Figure 2: Box plot for the data 


Since, deciding an appropriate distribution is subjective, applying a parametric test is not 


reasonable and hence, we need alternative procedures to judge the hypothesis. 


2 The hypothesis 


Suppose X;,7 = 1,2,..,n and Y;,7 = 1,2,..,m are independent observations from distribu- 


tions F and G respectively. Then the null hypothesis can be expressed as 
Ho: F(x) = G(x) for all x . 


The most general two-sided alternative is Ha : F(x) # G(x) for some x and the one- 
sided alternatives are H, : F(x) > G(x) for all x with strict inequality for some x or 
H, : F(x) € G(x) for all x with strict inequality for some x. Note that F(x) < G(x) (or 
F(x) > G(x)) for all x with strict inequality for some x implies Y is either stochastically 


larger or smaller than X. 


2.1 How to set the alternative? 


Consider the data example. Then the objective is to know which brand is giving the more 
lifetime. That is for which brand the lifetime is expected to be higher. Suppose X(Y) is 
the lifetime variable for old(new) brand bulbs. Then our interest is to know whether new 
brand bulbs are better , that is, Y = X. Thus for the given data the appropriate alternative 
should be H, : F(x) > G(x) for all x with strict inequality for some x. Depending on the 


need of the situation, the other alternatives are set. 


2.2 Hypotheses under the location model 


Suppose the alternative of interest is simply a difference in location(e.g. difference of average 
lifetime), then we assume G(x) = F(x — 0)Vx. That is the two populations differ only in 
location. Then , 
F(zx)—Girwvzxe-o 
and 
F(x) >< G(r) x & 0 »« 0. 
Thus the testing problem reduces to Ho : 0 = 0 against all alternatives. Note that under a 


location model 


0 ><0= the second population is shifted to the right(or left) of the first population. 


dF(x) . 
dr —— 


Now for a clear view of the location model, assume that f(x) exists for all x. Then 


the nature of f(x — 0) for different 0 , that is, under different hypotheses can be graphically 


traced as below. 


3 Mann-Whitney Statistic & properties 


Suppose the X observations and Y observations are mixed together and ordered according 


to their magnitudes. Mann-Whitney statistic is based on the position of Y observations 


4 


0-0 


Density 
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 


-4 -2 0 2 4 


Figure 3: f(x — 0) for different 0 


in the combined sample. If 0 > O(or « 0) ,then most of the Y observations are likely to 
be greater(or lower) than most of the X observations in the combined sample. Thus the 
number of times an X observation precedes a Y observation could be large(or small) in such 
a case. This suggests to use the statistic U = 577 4» 77 , I(X; < Y;) for our purpose. U was 
suggested by Mann and Whitney and are often called Mann-Whitney U . 


3.1 U is distribution free 


Let (X,.., Xm, Yi, .., Yn) be the combined sample, where N = m+n. Now assume F =G, 


and consider the representation 


TL 


U = Y (PG) < FY). 


i=1 j=l 
Under F = G, (F(X1),.., (Xm), F(Yi), .., F(Y;)) can be treated as iid observations from 
a R(0,1) population. Then under F = G, U is a function of iid R(0,1) variables. Thus 


the distribution is independent of any F under F = G and depends only on m and n. 


Hence U becomes a distribution free statistic and therefore, tests based on U are exactly 


nonparametric. 


4 U as a measure of degree of separation 


First of all, we need to know the range of U. Consider two extreme cases, namely, all the 
X's are larger than Y's and all the Y's are larger than X's. In the former situation U is 
minimum ,i.e. 0 and in the latter case it is maximum , i.e. mn. For other configurations, 
0 < U < mn and hence we get 0 < U < mn. U actually measures the degree of separation of 
the populations. Naturally U takes the extreme values if the two populations are completely 
separated. With the well mixed observations, U takes the intermediate values. The following 
figures will be helpful to understand this. From Figure 4, it is easy to observe that for 
completely separated observations U is either highest or lowest and U takes the intermediate 


values depending on how different observations from the two populations are mixed. 


4.1 Null Distribution of U 


Define p,,,(u) = Pu, (U = u), 0 € u € mn. Now arrange the n X observations and 


oM 


m Y observations in increasing order of magnitude. Then under F = G, each of ( z 


arrangements of n X's and m Y’s are equally likely. If Nn,m(u) is the number of arrangements 


m+n 
n 


such that the number of X's preceding Y’s is u, then prm(u) = T. We shall derive a 
recursion relation for N, ,,(u). 

Now in the arranged sequence, the last value is either an X observation or an Y observation. 
If the largest value is an X observation, it does not alter the value of U. Then the number of 
arrangements with the remaining (n-1) X values and m Y values with an X value as the last 
element so that U = u is N, 41,,(u). However, if the largest value is an Y observation, it is 
higher than n X's. Then the n X values and remaining m-1 Y values will give U = u — n. 


Then the number of arrangements with n X values and m-1 Y values with a Y value as the 


last element so that U = u — n is N,,4 1(u — n) . Thus we get 


NnmlU) = IN, wu) + Nnm-ilu — n). 


;m(u) 


Since Prm(u) = TE, we get the following recursion relation: 


N 


n 


Dn—1,m (u) + 


mcis n.m— u-—n)|. 
m+n mn u ) 


Pnm(U) 

We note that p,,,(u) = 0 if u < 0 and for n > 1, we have 
Pno = lif we 
= Oif u>0 


Also for m > 1, we have 


D0,m lif u=0 


= Oif u>O 


4.1.1 Finding Exact Distribution 


For small m and n one can easily compute the distribution of U. We shall give few examples. 


Suppose m=n=1, then U = I(X4 < Yi) ~ Bernoulli(.5). 
Next suppose n=2 and m=1, then U = 0,1,2. Thus 
No1(0) = Ni1(0) + N29(—2) = No1 (0) + Nio(—1) +0 — I. 

Noi(1) = Noi(1) + Nao(—1) = Noi(1) + Nio(0) =0+1=1. 
Then N54(2) = 3— 2 = 1 and hence P(U =u) = iV u. 
Next suppose n=2 and m=2, then U = 0,1,..,4. Thus 

No 2(0) = Ni2(0) + N31(—2) = No3(0) + N11(—1) = 1, 
No9(1) = Ni2(1) + Noi(—1) = Noo(1) + N11(0) = 0+ No1(0) + Mio(-1) = 1, 

and 
No2(2) = Ni 2(2)+No1(0) = No2(2)+.11(1)+Ni1(0)+No0(—2) = 0+No1(1)+-Ni0(0)+1 = 2. 
In a similar way N35(3) = N55(4) = 1 and hence P(U = u) can be obtained 


7 


4.2 U has a symmetric distribution 


Note that under F = G, X; — Yj E Y; — X; for every (ij). Then 


E < Y;) 253531: 5. 


i=1 j=l i=1 j=l 


Thus under F = G, U P omn —U. Therefore, under F = G, 


Hence the null distribution of U is symmetric about 77. 


4.3 U in terms of placements 


With n observations from F(x) and m observations from G(x), define Pj; = nFa (Yo). Po) 


is called the placement of Y(;) among the X observations. Then 


U = MM I(Xo € Yo) 
i=1 j=1 
= YD Xo € Yo) 
j=l i=l 
= MX: X@ << Ya) 
j=l 
com RUE 
j=l j=l 


Thus U can be represented in terms of the placements. 


4.4 Summary measures of U 


Under the null hypothesis, U has a symmetric distribution and hence Em, (U) = "*. Note 
that = can be looked upon as a U statistic with the kernel 6 = P(Y, > Xi). Then it is 
already proved that o2, = E(1 — G(X1))? — 67, of, = E{F(%1)}? — €? and 62, = 6(1 — 0). 
Then under the null hypothesis 0 = i, of) = $ = cê; and oj, = i. Using these in the 


expression of exact variance of U statistic, we get that Var(U|Ho) = MEM 


u—|—o apa u-—2 
=> L2 => 
=> 
= - 
- 
- 
- 
E. : 8| - a i 
- 
- 
=> 
- - - 
- => => 
x Y x Y x< Y 
43—2 u=5 u=6 
- - 
=> 
- 
- 
- 
- 
- 
= ~~ s 5 - 
= = - To - 
a "m a - a = 
=> => 
- 
- - - 
x Y x Y x Y 
U—6 145—171 [ü — 62 
=> 
=> 
- - 
- 
- 
- 
m = s - = 
E E E - 
- - 
- 
> - 
=> 
=> 
= => 
- 
- 
x Y x< Y x Y 


Figure 4: Degree of separation and the value of U 


What we provide in this module 


Direct derivation of Variance of U 

Different tests: Small sample and large sample 
Modification for tied and ordinal data 
Consistency 


Confidence interval for median using Mann-Whitney cut offs 


1 Variance of U: Direct approach 


We have already discussed derivation of variance of U using the results of U statistics. But 
such a variance can be developed from the first principle. Define D;; — I(X; « Y;), then 
U = M a2 Dij, then Dj; are not all independent and hence 


Var(U) — 2 Var(Da) 


» ` CoD. Di) 


i=1 1<j#k<m 


` ` Cov(Di;, Dij) 


j=l 1&izkzn 


D ` Cov( Dij, Dg) 


1«izg£n 1<j#k<m 


+ 


Evaluating each term, the final expression can be obtained. First of all note that Var(D,;) = 


p(1— p), where p = P(X, < Yi). Then 


Co Dy, Da) = PX Xy 


E fo NAE A 


ij» 


Cov(Di;, Dy;) = P(Y; > Xi Y; > XX) —-p 
= ] Podat) = p — p? 
and 


Cov(Di;, Dg) = P(Y; > Xi, Yy > Xj) —p 


p-p =0. 
Now we have mn Var(D,;), m(m-1) and n(n-1) covariance terms. Combining all these we 


get the final expression 
Var(U) = mn{p — (m +n — 1)p? + (m — 1)pı + (n — 1)p2}- 


2 


However, under F = G, p = 1, py = p = i and hence , we finally get Var(U|Ho) = 


mn(m--n4-1) 
12 ` 


2 Different Tests 


2.1 Small sample test 


From the intuitive justification, we have the following tests. For the alternative Ha : 0 > 0, 
a size a test can be expressed as 9? = I(U > Ua) + aI(U = Ua), where Ua is such that 
Ey,¢° = o. For Ha : 0 < 0, a size a test can be expressed as 9? = I(U < Ui 4) +al(U = 


U; 4), where Uia is such that Ey,¢° = o. However the distribution of S is symmetric 


mn 


2 
H, : 0 #0 can be expressed as 9? = I(|U — "| > Ua) --aI(|U — | = Ua) with Eg,9? = o. 


about under the null hypothesis and hence , a size a test for the two sided alternative 
One can also report the p values of the above tests to judge the strength of the observed data. 
If Uos is the observed value of U, the concerned one sided p values are either Py,(U > Usps) 
or Py,(U < Ug). The two sided p value is 2min{Py,(U > Uos), Pu, (U < Uovs)}. In any 


case, the null hypothesis is rejected if this p value does not exceed a. 


2.2 Large sample distribution & test 


a d 
Using the results of U statistic, under the null hypothesis, U* — emer is asymptotically 
"imm 


N(0,1). Asymptotic normality holds even for moderate (n,m) and for better understanding 


, we have plotted the mass function of U for various (n,m) and superimposed the correspond- 


ing normal density in the following figure. 


Since, the asymptotic normality of U* is sample sizes, different tests can be performed 
in large samples using U*.Consider testing Hp : 0 = 0 against Ha : 0 > 0.Then the corre- 
sponding large sample test is non-randomized and rejects the null hypothesis if the observed 


value of U* exceeds Tą. Similarly, the large sample tests for the other hypotheses can also 


M 
+ SJ 
zog o 
S 
o 
o - 
y o 
= 4 n=3,m=4 
2 n 
O 4 
5 o 
5 | x 
S & 
o 
eo 
ò 4 eo 
[s] 3 
o 
© 
3 4 S | 
o o 
m - 
eof oe 
= o 
o 
O 
o 
0 2 4 6 8 10 12 


Figure 1: Normal Approximation for small (n,m) 


be constructed. 


2.3 Continuity Correction 


The large sample distribution of standardised U gives an approximation of a discrete distri- 
bution by a continuous distribution. That is, we are approximating the area of a histogram 


by the area under a continuous curve. Thus, in general, Py,(U < x) will be underestimated 


z—mn/2 ) 


. So we need a modification, commonly known as 
mn(m--n4-1)/12 


by the approximation ®( 
the continuity correction, which uses the fact that 


z— mn/2 + 1/2 
y mn(m +n + 1)/12 


We explain the improvement by a relevant figure. For illustration, we have taken m=4,n=3 
and plotted the corresponding DF of U(i.e. the step function). We have also plotted the 
normal DF without continuity correction(the continuous curve) and the normal DF with 
continuity correction(the dotted thick curve). Naturally, the continuity correction covers 


more region. Thus, better approximation is expected after correction for continuity. 


10 


0.8 


0.6 


04 


02 


0.0 


[9] 2 4 6 8 10 12 


Figure 2: Continuity Correction with n=4,m=3 
3 Modification for tied observations 


We have already assumed continuity of F and G. But in practice ties can occur within either 
of the samples. However, in such a case we get a unique value of U. But if one or more X 
observation is tied with one or more Y observations, given definition of U is not applicable. 


In such a case we redefine Dj; as 


0 if X; — Y; 


Then the Mann-Whitney U reduces to Ur = » 77 , 5 77, Dij. If we define pt = P(Dij = 1), 
p = P(Dj; = —1) and pp = P(Di; = 0), then E(D) = pt — p~ so that E(Ur) = mn(p* — 
p~). The null hypothesis of interest in this case is Ho: pt = p . Now, Var(Dij) = 1 — po — 


(p* —p_)? and as earlier, Var(Ur) can be obtained. However, the variance of Ur conditional 


upon the observed ties under Ho can be calculated as Var(Ur|t) = funr HD (1— Dingli) 


where N = m +n, and t; is the multiplicity of the i th tie. Naturally t; = 1Vi in the non tied 

situation. Then for moderate N, large sample normality of the statistic Tea PU 9 holds 
ar Ho T 

and hence large sample tests can be performed. 


3.1 Mann-Whitney test for ordinal data 


Adjustment to incorporate ties enables Mann-Whitney test applicable for ordinal data. We 
start with an example. Consider the following data on grades obtained by a group of patients 


assigned Treatment 1 and Treatment 2(A being the highest). 


Tt1|A BC 
TJrt2|B C 


The objective is naturally to compare the performance of two treatments. However, the 
responses are tied as there are 2 observations with B and 2 observations with grade C. Now 


we shall calculate 3X2 Dj; values in the following matrix: 


Yo CIB'A 


Xi 
C |0]|1]|1 
B |-1/|0]|1 


The sum of the elements (i.e 2) is the value of Ur. Also the tied ranks are of length tg = 2 
and to = 2 Thus the modified statistic can be calculated for the purpose. 


4 Consistency of Mann-Whitney Test 


Consider testing Hg : 0 = 0 against Ha : 0 > 0. Now the test can be equivalently expressed 


in terms of T = . Then from the convergence of U statistic 


mn ` 


TPX Y 


If we define Y? = Y; — 0, then 
Y; ~ F(x —0) & YD ~ F(z). 


Thus we get P(X, < Yi) = P(X; < Y? +0). Now, if we define u(F) = P(X, < Y? +6) 
then (0) = 3 is the value of u(F) under F = G. Hence 


1 


1 
> 5 £620 


Thus the Mann-Whitney Test is consistent against the alternative H, : 0 > 0. Consistency 


against the other alternatives can also be proved. 


5 Confidence interval for 0 


We can use the cut off of Mann-Whitney test to get a confidence interval of 0 with confidence 
coefficient at least (1 — o). Consider the acceptance region of the non randomized level a 
Sign test for Ho : 0 = 0 against Ha : 0 # 0. Note that under any 0, X;,i = 1,2,..,n and 
Y; j = 1,2,.., m are iid observations from F. Now define 
U(0) 2 3 (Y; - Xi > 0). 
i=1 j=l 
Then Mann-Whitney test statistic is U(0). Now, due to symmetry, the acceptance region 
can be expressed as c € U(0) € mn — c, where c is such that Py,(c € U(0) € mn —c) > 
1— a. Let us arrange the mn differences Y; — X; in increasing order and denote them by 


Day < De) < E < Dicas Then 
U(0) <n=c e0 > Do. 


Again 
U (0) 2ce 0 < Tnt 


Thus c < U(0) < n— c is equivalent to Di € 0 < Ds o4). Then Po{ (Dic), Dimn—c+1)) 9 
0) = Po(c € D(0) € n — c). Since U (6), 2 U (0): we have 


P(c € U(0) < n — c) = Py, (c € U(0) € n — c). 


Since the test is of level a, the RHS probability in the above is at least 1 — a. Thus 
the coverage probability of the random interval [D(o, Dimn—c+1)) is at least 1 — a. Hence 


[D(c); Dimn—c+1)) is a confidence interval for 0 with confidence probability at least 1 — a. 


6 Determination of Sample size 


Consider testing Ho : 0 = 0 against Ha : 0 = 01(7 0) for specified 0. The experimenter 
should design the experiment in such a way that the difference 0, can be identified with high 
probability but with minimum the sample size. The usual technique is to determine (n, m) 
in such a way that the test has size a and power at the alternative 1 — £, where a and 6 are 
specified in advance. However determination of exact sample size is not easier in practice. 
We can use the large sample normal approximations to get a tractable formula for sample 
size. 

Consider testing Ho : 0 = 0 against Hı : 0 = 0ı(> 0) for specified 6;. Then following 
Noether(1987), the approximate total sample size N can be expressed as 


N o (Ta A Ta) 
~ 12A — A) (p — .5)? 


where p = P(Yi > X). 

Naturally, p depends on unknown F. Now, for the purpose of comparison, we consider three 
distributions, namely, Normal, Cauchy and Laplace. For the first population, we consider 
N(0,1), Cauchy(0,1) and Laplace(0,1) distributions and for the second population, we take 
the same distributions but with location parameter 0. Then we consider testing Ho : 0 = 0 
against Hı : 0 > 0. We have computed the sample size for various choices of 0(> 0), by the 
derived formula taking a = .05 and 6 = .2. The nature of the approximate sample size is 


plotted in the following figure. 


` 
` 
` 


e ` 
e Cauchy 
N ` 
` 
‘x 
N 
ce x 
a] 1c» x 
[72] = ` 
-o x 
a M 
= x 
e =) x 
[95] c - 
-— ~ 


50 


....Laplace 


Figure 3: Approximate sample size for different F 


Thus we observe that Normal distribution takes the lowest number of samples to reach 
the desired power level among the candidates and Cauchy distribution takes the highest 
number of observations to reach 80% power. In addition, as we increase 0, the required 


sample size decreases for each candidate. 


Nonparametric Inference: Module 15! 


What we provide in this module 


e Wald-Wolfowitz run test with properties 


e Different tests based on runs 


Kolmogorov-Smirnov two sample test based on ECDF 


Test for randomness 


‘Co-Ordinator: Dr. Rahul Bhattacharya, Department of Statistics, Calcutta University 


1 Test based on arrangements: The idea 


Suppose X;,7 = 1,2,.., n and Y;,7 = 1,2,.., m are independent observations from distribu- 
tions F and G respectively and our hypothesis of interest is expressed as Ho : F(x) = G(x) 
for all x. The most general two-sided alternative is Ha : F(x) Z G(x) for some x. For the 
test of general alternative, we can use the arrangement pattern of the observations to derive 
a meaningful test. Under the null hypothesis the two samples can be regarded as a single 
sample of size N = m +n from a continuous but unspecified population. Then we have a ) 
equally likely configurations of m X’s and n Y’s. The arrangement pattern gives information 
about the closeness of the underlying populations. For example, if the observations are well 
mixed, then we could think of identical populations. However, if there is a pattern in mixing, 
a deviation from the null is natural. Many statistical tests are based on the pattern of mixing 


in the combined arrangement of N observations. 


2 Wald-Wolfowitz Runs test 


This test is based on the theory of runs. A run is defined as a sequence of similar symbols 
preceded and followed by a different symbol or by no symbol at all. For example consider 4 


x’s any 3 y’s arranged in increasing order of magnitude in the following way: 
dope. ae gd 


Then we have a run of one x , then a run of 2 y's, then a run of 2 x's, a run of one y and 
lastly a run of a single x. The run length of the first run is unity, that of the second is 2 and 
so on. As another example consider 3 x's any 3 y’s arranged in increasing order of magnitude 
in the following way: 


ryvrycvry 


Then we have 6 runs each of length unity. 


Runs test is based on the total number of runs in the combined arrangement of n X's 


and m Y's. If the two samples are from the identical populations, the observations are 
expected to be well mixed. Thus R is expected to be too large. On the other hand if 
the underlying populations are different, observations from different populations tend to be 
separated. Hence R is expected to be too small in such a case. Thus small values of R are 
indicative of the different populations. 

However R is only a measure of deviation from F(x) = G(z)Vz and it does not differentiate 


st st 
between Y > X or X > Y. For example consider the configuration: 
pou nomo ww 


Then R=2 but all the X observations are less than Y observations. This might indicate 


st 
Y > X Consider another configuration: 


yyvyvyvrirtc 


st 
Then again R=2 but the configuration indicates the possibility of Y > X. Thus runs test is 
appropriate for the two-sided alternative is Ha : F(x) Z G(x) for some z. 


2.1 Distribution of R 


Assume that F(x) = G(r)Vx. Suppose the X observations are indicated by x's and Y 
observations are indicated by y's. Under the null hypothesis we have [ ) equally likely 
configurations of m X’s and n Y's. Suppose R, is the number of runs of x's and Rə is the 
number of runs of y’s. We shall first obtain the joint distribution of (21, R2). Suppose (ri, r2) 
is the observed value of (F4, Rə). Then note that rı = 1,2,..,n and rp = 1,2, .., m but rı and 
rg can not differ by more than unity. Because then two runs with the same symbols have to 
be adjacent, which contradicts the definition of runs. If rı = rə — 1, the arrangement starts 


with a run of y's and for rg = rı — 1, the arrangement starts with a run of x’s. However, if 


rı = rs, then the arrangement starts with either a run of x’s or a run of y's. Thus, we define 


clr, ro) = 0 if|ri—r94| 22 


—1 if [r1 — rə| = 1 


2 if |r, — re| = 0 


Now under the null hypothesis, each of (mn) arrangements of m x’s and n y’ are equally 
likely. Now to get rı runs of x’s we need to put rı — 1 bars in available n-1 positions. And for 
each such choice, if we put ra — 1 bars in available m-1 positions, we get r> runs of y’s. Then 
the total number of ways of getting rı runs of x's and rs runs of y’s is c(r1, r3) (= ex 


T1—1/ Nr2—1 


Then we have the joint pmf 


P(R;-rn,HR,—75)- 


lor f; —12,.. n and r = 12," 


Next we shall obtain the marginal distributions. By definition 


P(R4 — ri) Y PunBB = 12) 


T2 


bars) C 


oo 


and a simple algebra gives 
MENO 


n) 
m-—1 n+l 
In a similar way P(R = r2) = EGO, ss dea oh 


ya 12:23 


n 


Now to obtain the distribution of R = R4 + Ro, the total number of runs, we note that 


P(R, 4+ R5 — r) = XO P( =r, R =r- 11). 


ric 
Now for even r, the only possible value of rı is r/2. Thus 
n—1 m-1 
E DES 


(m) RCM PM 
TL 


P(R, 4 R5 — r) — 2 


However, for odd r, either rı = (r — 1)/2 or rı = (r + 1)/2. Then we get 


DE a) 


P(R, + R5 =r) = (m) | (9) IPM 3:9; 


2.2 Moments of (RH, Rə) 


We start from the following lemma. Lemma: For any a, b and k 
i3 a b (a+b 
fa NJ NE) Xak/ 


Proof: It is easy to observe that 


(7) coefficient of 4 in (1+ £)* and s) —coefficient of t*** in (1 + t)^. Thus we get 


ti 


a Die. bu D p.a 
x( no = coefficient of t" in (1+ >) (14 t) = (hee) 


i 


This completes the proof. 


Now assume that m > n. Then with the notation r = cA r 2 g, we get the factorial 


moment 


ER,” = S (ri) Pu = rı) 


ri-l 


bo 
using the above lemma. In a similar way we get 
pw) 
ER, = (n 4- 1)9 (m 
Now to obtain the expectation of R, we take g = 1 and get E(R,) = m) In a similar 


way, we get E(R5) = ™@*) Hence we get E(R) = E(R;) + E(R5) = 1 + 222 


m+n ^ m+n ` 


Again taking g = 2, we get ER4(R4 — 1) = DUAE and hence we have 


Var(Ri) ERı(Rı — 1) + E(Rı)(E(Rı — 1)) 
(m+ 1)9 80 


(m 4- n)(m +n) 


5 


(n4-1)2 ml?) 


In a similar way we get Var(R2) = (m mima 


To obtain the variance of R, we need to compute the covariance between Rı and Ry and 


hence we start from the joint factorial moment, defined by 
uo? = E(Rı — 1) (Ry " 1), 


Now by definition, 


(ri — 1)! (ro — 1)! 
moo = 


(ry; -l—g)!(r2 -1- gy n r2) 


ri, T22g41 
where p(r1, r2) is the joint pmf of (Ri, R2). Then, again using the lemma we get 


(n — 1)(9 (m — 1)(9 m 
ue? = ( ) 


i eat) 
where ři = r1 — g9, 2— T2 — 9, i = n — g and m — m — g. 
(0 (n-1)(m-1) (7777?) 


n 


Now putting g = 1, we get E(R; —1)(R5—1) = (^?) and hence using the relation, 


n 


Var(R) = Var(Ri) + Var(Ra) + 2cov( Ri, R2), we get that Var(R) = E EET For 
further details, we refer the reader to the book by Wilks(1962). 


2.3 Different tests: small sample & large sample 


From the intuitive justification, we reject the null hypothesis if R < c for some c satisfying 
Py,(R < c) < a. The test can also be performed in large samples. Now if we assume 


n 


— A € (0,1), then we have the following approximation E(—2-) = 2A(1 — A) and 


m+n m+n 
Var( 725) z AM (1 — \)?. Then under the null hypothesis, if min(m, n) — oo such that 
mu > à € (0, 1), then 

— 2NA(1— 


2A(1 — A)VN 


where N = m+n. Since the large sample distribution of R* is standard normal, a large 


sample test rejects the null hypothesis if R* < —7,. 


3 The Homogeneity Problem 


We have already observed that Run test is insensitive to stochastic alternatives, like H, : 
F(x) > G(x) for all x with strict inequality for some x (i.e. Y z X) or Ha : F(x) < G(x) for 
all x with strict inequality for some x (i.e. X Š Y). If we can measure the discrepancy be- 
tween F(x) and G(x), tests for one and two sided alternatives can be derived. From the the- 
ory of U statistic, the empirical DF's F p(x) and G;,(x) are the U statistics for the estimation 
of DF's. If F = G, there should be a good agreement between F(x) and G,, (x) for all values 
of x. Some overall measure of discrepancy between these functions is then a natural test 
statistic. Then just as in Goodness-of-fit problems, we use the statistics sup, (F,(x) — Ga (x)) 
or sup, (Gin(x) — F,(x)). These statistics are nothing but the Kolmogorov-Smirnov(KS) dis- 


tance function measuring the discrepancy between two continuous DF's. 


4 KS two sample test 


Kolmogorov-Smirnov(KS) two sample test of homogeneity is based on the KS distance func- 
tion, defined above. Consider the alternative Ha : F(x) > G(x) for all x. Then our statistic 
is Dfm = sup,(Fr(a) — Gm(x)) and we reject Ho if D}, is too large. For the alternative 
is Ha : F(x) € G(x) for all x, we use Dj, = sup,(Gs(x) — F,(z)) and we reject Ho if 
D} m is too large. However, for the alternative is H, : F(x) Z G(x) for some x, we use 


n,m 


Dog = maz( D} 


n,m? 


Dim) = SUP» |Fn(£) — Gm(x)| and we reject Ho if Drm is too large. 


4.1 KS test is nonparametric 


Consider the statistic Drm = sup, |F,(x) — Gm(x)|. Now assume F = G and consider the 
new set of data, F(X;),à = 1,2,..,n and G(Y;),j = 1,2,.., m. Note that F,(x) and Ga (x) 
does not change for the new set of data. Now the transformed data are observations from 
a R(0,1) population and hence the distribution of Dy, under F=G depends only on m and 


n. Thus Drm and hence Dfm and D; ,, provides exactly nonparametric tests. 


4.2 Consistency of KS test 


Note that Fa and Gm are strongly consistent for F and G respectively. Thus Fn — Gm is 


consistent for F — G and hence Dnm, Dim and D}, are all consistent estimators of the 


corresponding population quantities. Under F=G, all the population quantities are zero. 


On the other hand, it can be shown that D,,,, D} m and D 


RE converge in probability to 


m 


positive values. Thus KS test is consistent. 


4.3 Large sample test 


Now it can be shown that 


mn D 
4—— Din > x3 
m+n ? 


as min(m,n) — oo. Thus a large sample test for Ho : F(x) = G(x)Vz against Ha : F(x) > 


G(x) for all x rejects the null hypothesis if 


Again, it can be shown that 


lim P( ae 


" mE 


Dig c2) H(z)el-29 (1j de 9. 
ib 


Thus a large sample test for Hy against Ha : F(x) # G(x) for some x rejects the null 
hypothesis if .\/" D, 4, > da, where da is a root of H(d,) = 1 — a. 


m+n 


5 Test for randomness 


The most important application of runs test is to provide a distribution free test of random- 
ness. We start with a real life example. Suppose the students entered a lecture theatre in a 
certain order. Then it is of interest to know whether students of the same gender come in 
together. If we denote male students by M and female students by F, then we get a sequence 


of symbols M and F. Naturally well mixed symbols tend to contradict that the students of 


8 


the same gender come in together or they enter the lecture theatre in a random fashion. 
Any pattern in the symbols possibly indicate lack of randomness. Thus our null hypothesis 
in this case is the randomness and the alternative is naturally lack of randomness. The 
arrangement of M and F gives a number of runs of these symbols. If the number of runs is 
too few or too large, some pattern is expected. Therefore, a contradiction to the random 
pattern is expected and hence we reject the null hypothesis in such a situation. 

Formally if X;,i = 1,2,..,n are observations from a continuous population F(x) then the 
null hypothesis can be expressed as Hp : X4, X2, .., Xn is a random sample. However, the 
alternative, that is , lack of randomness does not specify any particular pattern. Now to 
dichotomize the data(if the data is not naturally dichotomised) we use some statistic T' based 
on all the observations. If we put the symbols L and U as the observation is lower or greater 
than T, we get an arrangement of two symbols. Naturally, we reject the null hypothesis, if 
the total number of runs is too small or too large. Using the large sample distribution of 


runs, a large sample test can also be performed. 


Nonparametric Inference: Module 16: 


What we provide in this module 


Concepts of concordance and discordance 


Kendall's 7 with its estimate 


Properties 


Test of trend 


1 Co-Ordinator: Dr. Rahul Bhattacharya, Department of Statistics, Calcutta University 


1 Measure of Association in parametric context 


Suppose (X;, Yi), i = 1,2,..,n are iid observations from a continuous bivariate distribution 
F(x,y) and the objective is to get some idea about the nature of dependence between X and 
Y. The most frequently used measure is the Pearson correlation coefficient, defined by 
Cov( X,Y) 
P= VVar(X)Var(Y) 


p is a scaled measure of degree of linear association and the direction of association. A large 
positive value(of course lees than unity) indicates that large values of X are paired with large 
values of Y and vice versa. 

Although p is well accepted as an association measures but it suffers from a number of 
drawbacks. First of all, it is a measure of linear dependence between X and Y, i.e., if the 
actual relationship is actually or at least approximately linear, p is a good measure. But 
if the actual relationship is far from the linearity, p gives misleading results. For example, 
if X ~ N(0,1), then the covariance between X and Y = X? is zero, that is p = 0 and 
hence there is no linear relationship between X and Y. Then X and Y are uncorrelated, 
but there is actually an exact quadratic relationship. In addition, p heavily depends on 
the existence of the first few moments of the underlying distribution. Lastly, the value of 
p remains unchanged if the underlying variables are replaced by some linear functions with 
positive slope. That is, if the transformation is linear and order preserving then o does not 
change. However, if the transformation is non linear but order preserving, the value of p 


may be changed. 


2 Concordance & Discordance 


In order to get invariance under order preserved transformation, we must depend on relative 
magnitudes instead of absolute magnitudes and consequently, we introduce the concepts of 
concordance and discordance. Consider two pairs of observations (x,y) and (z', y’). These 


pairs are concordant(or in agreement) if they vary in the same direction, that is, if r < a’ 


2 


whenever y < y' or if x > x’ whenever y > y/. 

Again the pairs (x, y) and (2’, y) are said to be discordant (or in disagreement) if they vary in 
the opposite direction, that is, if x > 2’ whenever y < y' orif x < z' whenever y > y'. In other 
words, if the line joining these points has a positive slope, the points are concordant and 
discordant otherwise. Naturally, concordance or discordance property is preserved under 
an order preserving transformation. The concept of concordant and discordant pairs are 


explained through a figure below. 


0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 


Figure 1: Concordant & Discordant pairs 


2.1 Concordance and Discordance probabilities 


For any two independent pairs (X1, Y1) and (X2, Y2), the probability of concordance is defined 


as 


Te 


P{[(X1 > X&Y > Yo)] U (X1 < X2)&(N < Y3)]) 


P{(X1 — X2)(Yi — Y2) > 0j 
(Y — Yi) 


EE (Xa — X1) 


>0} 


Thus 7, can be looked upon as the probability that the straight line joining the random 
points (X4, Y1) and (Xo, Y5) has a positive slope. Similarly, the probability of discordance is 
defined as 


Tg = P(X; — X3)(Xi - Y2) < 0} 


Thus the probabilities of concordance and discordance are invariant under order preserved 
transformations and hence a measure of association can be developed by the combinations 


of these probabilities. 


2.2 Kendall’s 7 


Since perfect association is either perfect concordance or perfect discordance, a difference 
between 7, and 7-4 measures the extent of association. The difference me — ma is defined as 
the Kendall’s 7 coefficient. If the marginal distributions are assumed to be continuous then 
P(X, — Xə = 0) = 0 = P(Yi — Yo = 0) and hence we get mte + ma = 1. Thus we can express 
T aS T = 2Te — 1 = 1 — 27,. Naturally, 7r does not depend on the moments of the underlying 
distributions and hence is always defined. Since 7 is the difference of two probabilities, we 
have -1 <7 <1. 

Now we shall explain independence in terms of r. Under independence, X, — X5 and Yi — Y2 
are independent and hence 


X, 2 X and Y, 2 Y, 


Thus under independence 


Te P(X — X3)(Y — Yə) =O} 


= P((X»- X) — Y2) > 0j 
i GIU, 3X3), 0] Sa 


Hence 7 — 0 under independence. 


However, the converse is not in general true and hence we get a similarity with p. For ex- 


ample, consider a bivariate normal population with correlation coefficient p. Then 


Independence & 7 — 0 & p= 0. 


To be specific, suppose (X;, Y;),? = 1,2 are iid observations from a No(0x,0y,0%, 0%, p) 


P e . x š — Xi1— X2 — Y-Y ; : : 
population. Then the joint distribution of U = T and V cng isa standard bivariate 


normal distribution with correlation coefficient p. Then using the properties of bivariate 


normal distribution, we get 
| sp uc 
pe = P(UV »50)—--z--sin p. 
2 T 


Then 7 = 2 sin”! p and hence we get the desired relationship with p. 


2.3 Estimating Kendall’s 7 


Suppose (X;, Y;), i = 1,2,.., n are iid observations from a bivariate continuous distribution 
F. If we define sgn(u) as 
sgn(u) = +lifu>0 


= 0Qifu-0 


—] ifu < 0, 
then we get 


sgn(.X4 = Xoə)sgn(Yı T Y2) = +1 if (Xı = Xə) V E Y2) >0 


—1 if (X; — X3)(Y1 - Yo) < 0 
and consequently 


Elsgn(Xi = X»3)sgn(Yi = Y2)} = P{(Xi = Xə) (Yı = Y2) > 0} 
— P(X; = X3)(Xi — Y2) < 0} 


Te — nq =T. 


Motivated by the above, we define the symmetric kernel 
ó(CX1, Yı), (X5, Y2)) = sgn(Xi = X3)sgn(Yi V Y3) 

with degree 2. Then the corresponding U statistic is 

1 

U = — M; {sgn(X; - Xj)sgn(Y; — Y;)} 
(2) 1<i<j<n 
1 
= (p REN pairs — #discordant pairs} 

2 

1 

(2) 
where C(D) is the number of concordant(discordant) pairs. U is defined as Kendall ’s sample 


tau measure. 


2.4 T as correlation coefficient 


Although the notions of Kendall's 7 and correlation coefficient p are contrasting but in- 
terestingly the sample measure of the former can be expressed as a product moment cor- 
relation coefficient. Define Ci; = sgn(X; — Xj) and Dy = sgn(Y; — Y;). Assume that 
X; 4 X; and Y; # Y; for every i # j then 5, C} = P ı D} = n(n— 1) Again 
P uci Cu Di = 2(C — D). Thus we can write 


ave im 
2 
T fac 1 3 ua Ph 


and hence U is expressed as sample correlation coefficient. 


i, j=1 


2.5 Summary Measures of U 


From the properties of U statistic, E(U) = 7 and 


(2(n = 2) + od) 


(3) 


Var(U) = 


Since 
di( CX; Yi) T E{o((X1, Yı), (Xə, Y2) (X2, Yo)}, 


we get 


E{sgn(X1 — Xa)sgn(Yi — Y2)|(X2, Yo) } 


—1-—2F(X;,oo) - 2F(co, Y1) + 4F (X1, Yı) 
Then a routine algebraic manipulation gives 
o? = Var(1 — 2F(X1,oo) — 2F(oo, Y1) -AF(X,,Yi)] 
However, 


o2 = Varísgn(Xi — X3)sgn(Yi — Yo)} 
= 1 = T? 


Hence for specific choices of F, the variance can be obtained. 
However, if we assume independence, that is F(x,y) = F(r,oo)F(oo, y)V(z, y), then 


1 1 
1 — 2F(X1, 00) — 2F(oo, Y.) + 4F(X1, Y1) 2 4(V — =), 


where V and W are iid R(0,1) variables. Thus we get c? = 5. Moreover, under independence 


4n+10 
9n(n—1)' 


c2 = 1 and hence Var(U|independence) = 


2.6 U is distribution free 


Assume independence, that is F(r,y) = F(a,0co)F (oo, y)V(a,y). Then if F denotes the 


common univariate DF, then 


fi. 5 25 [sgn(X; — X;)sgn(Y; — Y;)} 
" 5 55 {s9n( F(X) — FX) sF Y) - FY} 


Thus U is expressed in terms of independent R(0,1) variables and hence, under independence, 


U has a distribution independent of F. 


2.7 Symmetry of U 


Under independence 


F(x,y) = F(x, oo)F(oo, y)V(z, y), 


we get that for every (i, j), 


Thus $((X1, Yi), (X2, Y2)) has the same distribution as 1 — $((X1, Yı), (X2, Y3)) and hence 
the distributions of C and D are identical. Thus 


epe eo 


and hence 


U 2 —u, 


that is, under independence, the distribution of U is symmetric about 0. 


2.8 <A test of independence 


Suppose we are interested in testing independence, Ho : F(x,y) = F(x,oo)F(oo, y)V(x, y) 
against dependence that is Ha : F(x,y) Æ F(r,oo)F(oco, y) for some (x,y). For underlying 
dependence, U tends to be either too large or too small. Thus too large and too small values 
of U indicate the strong possibility of dependence. Again the distribution of U is symmetric 
about 0 under the null hypothesis. Thus we reject the null hypothesis, if |T'| is too large. 


Again from the theory of U statistic, as n — oo, 


4 
va ^ N(0, 5). 


Then a large sample test rejects Ho if 3^ |U| ay: 
From the exact distribution of U, one can also calculate the p values to judge the strength 
of the observed data. Suppose u is the observed value of U. Then the onesided p values are 


Pm, (U > u) and Py,(U < u) corresponding to the underlying alternatives H, : F(x,y) > 


F(x,oo)F(oo, y) for some (x,y) and Ha : F(x,y) € F(x,oo)F(oo, y) for some (z, y), respec- 
tively. However for the two sided alternative Ha : F(x,y) Æ F(x,oo)F(co, y) for some (x,y), 
the p value is defined as 2min(Pg,(U > u), Pu, (U € u)). We reject the null hypothesis if p 


value does not exceed a in any case. 


2.9 Presence of ties 


We have assumed continuity of each marginal DF to avoid ties. But in practice, ties can occur 
within either or both samples. However, U is based the quantities sgn(.X; — X;)sgn(Y; — Y;) 
and hence assigns a value zero if a tie occurs. Thus as before U remains unbiased for 7 even 
under the presence of ties. Now if we have t ties in X observations of lengths ky,..,k, and s 


ties in Y observations of lengths k4, .., k;, then U can be expressed as 


T 


U uo Cai Di 


ynin — 1) - Di (ki - D /n(n - 1) - a E, 1) 


The resulting modification is often called Kendall's 7; measure. 


3 Association in discrete distributions 


From the development so far, we observe that Kendall's 7 is defined only when pe + p, = 1. 


In particular if the marginal distributions are not continuous, we have 


Te+ Pa = P{(Xi — X;)(i- Yj) > OF PX — X;)(i — Yj) < 0j 


1— P((Y; - Y;)(X: — X5) = OF 


Naturally po is the probability that a pair is neither concordant nor discordant. Thus 7 is 


not a sensitive measure of association if pọ > 0. 


Thus we define the conditional probabilities of concordance as 


N PUY —Y)06- Xj) > 0} 
"ec PUY VGA) $0] EP -Y)05 35:80) 


Naturally p? = i 3 Similarly, the conditional probability of discordance is p = 1 — pi. 


Then a modified association measure can be defined as 7* = p? — p;. Since T* = ip We get 


the corresponding sample version U* = a where fo is a consistent estimate of pg. Since 
the large sample distributions of U* and U/(1— po) are the same, the asymptotic distribution 


and a large sample test based on U* can be derived as earlier. 


4 Application of Kendall's: Test of Trend 


For data coming from some measurement processes, often some trend is observed, i.e. obser- 
vations, in general, depend on time. In the test of randomness, we have seen that a specific 
pattern of run indicates the presence of trend. Now we shall discuss how Kendall’s 7 can be 
used to check the presence of trend. 

Suppose X indicates the time variable and Y denotes the corresponding observation. Then 
an association between X and Y might indicate a presence of trend and this suggests to use 
Kendall’s 7 to serve the purpose. Suppose Y; = 2,4 = 1,2,..,n are the time ordered observa- 
tions, then the hypothesis of independence of the time ordered observations Y;, i = 1,2,..,n 


can be tested through 7. In such a case, we take X; = i, i = 1,2,.., n and define the statistic 


as 
1 
U = — ` sgn(j — i)sgn(Y; — Y;) 
(3) 1<i<j<n 
= = yo sgn(Y; — Yi) 
(7) 1<i<j<n 


U is expected to be higher or lower according as the underlying trend is upward or downward. 
Thus we reject the null hypothesis of the absence of any trend against the presence of an 
upward(downward) trend if U is too large(small). Note that the null distribution of U is the 


same as before and hence tests can be constructed. 


10 


Nonparametric Inference: Module 17! 


What we provide in this module 


e Ranks & Midranks 


Properties of ranks 


Linear rank statistics 


Properties 


‘Co-Ordinator: Dr. Rahul Bhattacharya, Department of Statistics, Calcutta University 


1 Ranks & their properties 


1.1 Definition of ranks 


For N iid observations Z;,? = 1,2,.., N from a continuous population the rank of the i th 
observation is the number of observations that are less than or equal to Z;. Then Rj, the 
rank of the i th observation is defined as the number of observations that are less than or 


equal to the i th observation. That is 


Ri 


N 
S ONZE) 
j=l 


= # observations less than or equal to Z;. 


However, for ordered observations Z4, < Za < .. < Zyn, Ri is simply i. Now observe that 


N 
R, = MII(Z x Z) 
j=l 


= NFy(Z;), 
where Fy is the ecdf. Thus we have the representation 


N 
Re=1+ So (Z; <Z)=1+T7(%), 
(4i)= 


j 1 


where the conditional distribution of T(Z;) is Binomial(N — 1, F(Z;)) for fixed Z;. 


1.2 Joint distribution of Ranks 


Suppose Z = (Zi, Z2, .., Zw) are iid observations from an absolutely continuous distribution 
F, Y is the corresponding full set of order statistics and Py is the set of N! permutations of 
the first N natural numbers. Then the joint density of Z given the full set of order statistics 
is 


1 
f(zly) SE Np € Pw. 


Since Z is fully determined from the knowledge of Y and the vector of ranks R = (Rj, Rə, .., Ry), 


and conversely, we get 


1 
P(R=rly) = wre € Pw. 
Hence P(R =r) = iD r € Py. Thus the joint distribution of R is independent of any F. 


However, one can arrive at the same joint distribution considering the fact that for distinct 
observations 


(Ri =r; Vi  1,2,., N) € (Zp < Zia < .. < Zu) 


for some permutation (71, 72,..,7N) € Py. As a simple consequence of continuity, we get as 


earlier 


1.3 Marginal distributions of Ranks 


Using the definition of marginal distributions, one can obtain the marginal distribution of 


R; as 


P(R;-m) = pP P(R=r) 


| 
x 
rm 
N 


Hence the marginal distribution of each rank variable is discrete uniform over {1,2,.., N}. 
Thus the following are immediate 


N+1 N?—1 
= —— an 


E(R;) 7 


Now by definition, the marginal joint distribution of R; and R; is 


P(R; =1;,R;=1;) = ` P(R =r) 


1<ry eF Tti- Frit AFT Tj- Fj F TNEKN 


— NH 


D and hence the correla- 


As a simple consequence, we have the following Cov(R;, Rj) = 
tion coefficient between R; and Rj comes out as — z+. Since y» R; = N(N+1)/2, Ri and 
R; are inversely related and hence the sign of the correlation coefficient become negative. 


However, for large N, the elements of the rank vector become uncorrelated. 


1.4 Midranks: Ranks under tie 


However, the distribution of ranks is not so simple when the underlying distribution is dis- 
crete. Actually discrete parent allows the possibility of tied observations and hence the usual 
definition becomes unsuitable. Thus if the underlying distribution is continuous, the two ob- 
servations being identical has probability zero but for discrete distributions observations can 
be identical with positive probability. In practice ties can occur due to the limitation of the 
measurement process and in such a case we introduce the concept of midrank. Midrank is 
the average rank assigned to each member of a set of tied observations. 

We start with an example for better understanding. Suppose the observations are (2,2,7,1,5). 
Then the sorted observations are (1,2,2,5,7). Then the second and third observations are 
tied at 2. Thus the midrank assigned to these observations(tied at 2) is (2+3)/2=2.5. Then 
we get the modified rank vector corresponding to the sorted observations as (1,2.5,2.5,4,5). 
Thus the midrank of an observation is the number of observations lower than it plus the 
average rank due to tie. For a set of observations Z;,i = 1,2,.., N, the midrank of the k th 


observation is defined as 


N N 
cad = Zeal 
Rj, = Midrank(Z) - M IZ AE »EIZ k) 


j=l 


Naturally, if the observations are distinct, we get Rj = Ry for any k = 1,2,.., N. 

For the set of ranks, it was observed that the joint distribution is independent of the parent 
population. However, in case of midranks such a result does not hold. For example, consider 
N=2 and assume that Z; are iid Bernoulli(p). Then the possibilities together with the joint 


distribution is are given below. 


Table 1: Distribution of midrank for N=2 


(aia) | QUE : | PR See) 
(0,1) (1,2) p(1-p) 

(10) | (2,1) p(1-p) 

(D (15,15) | G-p}? 

(0,0) (1.5,1.5) |p? 


Unlike the ranks in the untied case, here rank takes three possible values, namely, 1,1.5 and 
2 with unequal probabilities. Again it is easy to observe that the marginal distributions 
are also different from uniform for any p € (0,1). As another example, consider three iid 
Bernoulli(p) variables Z;. As before, we list the possibilities but obtain only a marginal 


distribution: 


Table 2: Distribution of midrank for N=3 


(21, 22,23) ri | P(Ri = ri) 
(0,0,0) 2 Cap) 
(0,0,1) 1.5 | p(1— p}? 
(0,1,0) 1.5 | p(1— p}? 
(0,1,1) 1 | (1—p)p? 
(1,0,0) 2.5 | p(1— p)? 
(1,0,1) 2.5 | p?(1— p) 
(1,1,0) 2.5 | p?(1— p) 
(1,1,1) 2 p 


Naturally, the marginal distribution is different from uniform. 


2 Linear rank statistics 


Thus we find that the distribution of ranks is independent of the underlying population as 
long as the continuity holds. Consequently, ranks form a basis for developing distribution 
free procedures to identify the possible differences between populations. For example, if one 
population has a larger location measure, then one can expect higher ranks corresponding 
to this population in the combined sample observations. Most commonly used tests can 
be expressed in terms of linear combinations of the function of ranks and indicator of the 
populations. Such linear combinations are termed linear rank statistics and they can be 
used for a number of hypothesis testing problems. However, replacing the observations by 
the corresponding ranks often incurs a loss of information. But it can be shown that the 
observation and the corresponding rank are highly correlated and hence such loss is not 
highly consequential. 

Suppose Z;,7 = 1,2,.., N are iid observations from an unknown but continuous distribution 
F and R;,i = 1,2,..N are the corresponding ranks. Thus it follows from the continuity of 
F that ranks are distinct with probability one. Then a linear rank statistic is defined by 
L= v c;a( Ri), where cj, i = 1,2.. are regression constants and a(i),i = 1,2,.. are scores. 
We have indicated earlier that the joint distribution of ranks is independent of any F. Being 
a function of ranks, the distribution of L is, therefore, independent of any F. Hence L has 


the ability to provide distribution free tests for different alternatives regarding F. 


2.1 Properties of LRS 


Now we shall discuss different exact and asymptotic properties of LRS. 


Result :(Summary measures) For a linear rank statistic L, 


E(L) = Nea, 


and 


N? 
Var(L) = Ns p, 
N N N N 
wherec — $5,c,ü = 4Y ag), o2 = $Y (a; — a)? and o2 = 4 Y (G — 2)” 
ici ici ici i=l 
Proof: By definition, 
N 
E(L) = MiaE(a(R)) 
i=l 
e Sal Das eNe 
u i N l T N = 


Now we note that since the ranks are correlated, different terms of L are, therefore, depen- 


N 
dent. Hence Var(L) = 3; c? V (a(R;)) + X cic; cov(a(R;), a(R;)). Now it is easy to obtain 
i=l izj 


that Var(a(R;)) = o2 and cov(a(R;), a(R;)) = aa Combining all these the result follows. 


Result :(Symmetry) For a linear rank statistic L, if either c;+cy—i+1 or a(t) +a(N — i 4-1) 
is a constant for all i, the distribution of L is symmetric about its expectation. 

Proof: Observe that (R1,.., Ry) B (N—R,-1,N — Ro 1,.., N — Ry +1) and assume 
that a(i) + a(N — i+ 1) = kVi. Then a = k/2 and hence 


N 


pA cja(N — R; +1) 


Ib 


L 


IS 
hé 
LQ 
> 
| 
A 
ES 
e 
n 
5 
08 
et 
> 
© 
w 
n 
n 
= 
3 
® 
a 
Q 
re 
= 
o. 
T 
© 
5 


IS 


Thus using the fact that k = 2a, we get 

L- Nca/2 2 Ncaj2 — L. 
Hence the symmetry follows. A similar result can be proved using the condition on regression 
coefficients. 


The next result is on asymptotic normality of the linear rank statistic. However, before going 


to details, we need to define Wald-Wolfowitz, Noether and Hoeffding conditions. 


7 


Wald-Wolfowitz Condition(W): <A sequence {b„(i), i = 1,2,.., n) is said to follow 
Wald-Wolfowitz Condition if 


as n — oo for r > 3. 


Noether Condition(N): 1b,(i)) is said to follow Noether Condition if 


REC Pn 
Doa) = bn)? 


E = gr) 


as n > oo for r > 3. 


Hoeffding Condition(H): A sequence (b,(i)) is said to satisfy Hoeffding Condition 


(b. (i) — bn)? 
osi Joy (bali) — bn)? 


= o(1) 
as n — oo. 


It can be shown that W condition implies N condition but conditions H and N are equiv- 


alent. Now we state the asymptotic normality of linear rank statistic in the following result. 
Result:(Asymptotic normality): If 

1. (A, .., Ry) has a uniform distribution 

2. (c1, C2, .., CN) satisfies condition W, and 


3. (a(1), .., a(.N)) satisfies condition N, 


then as N > oo, 


However, we skip the proof but provide some examples of regression constants and score 


function satisfying the above requirements. 


Suppose c; = Oorl asi € n or i > n+1, where i = 1,2,.., N with N = m+n. Also 
m — m(N) and n = n(N) are such that $ — A and 4 + 1—A as N — oo, where A € (0, 1). 
Then it is easy to observe that (ci, c», .., cy) satisfies W. 

On the other hand, if we choose a(i) = 0 or 1 as i < [S] or i > [2] -1 with à = 1,2,.., N, 
then it is easy to obtain that N~! max;(a(i) — (a))? = (Q(N-!), as N > oo. Thus condition 


H and consequently condition N is satisfied by {a(i), i > 1). 


Nonparametric Inference: Module 18: 


What we provide in this module 
e Applying LRS in two sample location problems 
e Applying LRS in two sample scale problems 


e Multiple sample location problems 


1 Co-Ordinator: Dr. Rahul Bhattacharya, Department of Statistics, Calcutta University 


1 Two sample problems & LRS 


We have already seen that the distribution of LRS is independent of any F under continuity 
and hence we can use LRS to provide distribution free tests. In addition LRS is asymp- 
totically normal under fairly general conditions and hence large sample tests can also be 


constructed for various problems. However, we start with two sample location problems. 


1.1 Two sample location problem 


Suppose 


independently of 


where F and G are unknown but known to be continuous. For a location problem, we 
assume G(x) = F(x — 0) for some —oo < 0 < oo. Then our objective is to test Ho : 0 = 0 
against one of the alternatives Ha : 0 > 0 or Ha: 0 < 0 or H,:0 # 0. Suppose Z = 
(X1, Xo, .., Xn, Yi, .., Ym) is the combined set with N = m + n observations. Also let R; be 
the rank of the i th observation in the combined set of observations,i—1,2,..,N. Now we take 


the population indicators 


c = Oifi<n 


lifi>n+1 


as regression constants. Then LRS reduces to 


which is simply the sum of scores corresponding to the second sample observations. Thus 
if we can choose an appropriate score, we can develop the corresponding test for the above 


hypothesis. However, for an appropriate choice of the score function a(.), we need to know 


the behaviour of ranks under a location alternative. 

Assume 0 > 0, then second sample observations are expected to be larger than the first 
sample observations. Naturally ranks corresponding to the second sample observations are 
expected to be larger under 0 > 0 than under 0 = 0. Thus if we take an increasing score 
function, then a(R;) are expected to be larger under 0 > 0 than under 0 = 0 for every 
i > n--1. Hence for such a choice, L is expected to be larger under 0 > 0 than under 0 = 0. 
Again for 0 « 0, second sample observations are expected to be smaller than the first sample 
observations. Naturally ranks corresponding to the second sample observations are expected 
to be smaller under 0 < 0 than under 0 = 0. Thus for an increasing score function, a(R;) 
are expected to be smaller under 0 < 0 than under 0 = 0 for every i > n + 1. Hence L is 
expected to be smaller under 0 < 0 than under 0 = 0 for such a choice of score. 

However for 0 Z 0, second sample observations are expected to be either smaller or larger 
than the first sample observations. Naturally, in such a situation, ranks corresponding to 
the second sample observations are expected to be either smaller or larger under 0 Z 0 than 
under 0 = 0. Thus for an increasing score function, a(R;) are expected to be either smaller 
or larger under 0 Z 0 than under 0 = 0 for every i > n+1. Hence L is expected to be either 
smaller or larger under 0 < 0 than under 0 = 0. 

Thus for increasing score function, we get right tailed test based on L for the alternative 
Ha : 0 »0anda left tailed test based on L for Ha : 0 < 0. For the two sided hypothesis Ha : 
0 Æ 0, a two tailed test based on L would be appropriate. However, for the selected choice 
of c;, the distribution of L under the null hypothesis is never symmetric. But depending on 
the choice of the score function, the distribution of L can be symmetric about Ej,(L) and 
hence in such a case the two sided test reduces to a right tailed test based on |L — Em (L)|. 
Since the distribution of L is discrete under the null hypothesis, exact size a tests will be, 


in general, randomized. 


1.1.1 Wilcoxon score & test 


Consider Wilcoxon score, that is, a(i) = i, then L reduces to 


N 
L= »» R;. 


i=n+1 

Thus L is simply the sum of the ranks corresponding to the second sample. L is well known 
as the Wilcoxon rank sum statistic. It is already derived that @ = m/N and Y? (ci — 
(c))? = mn/N. Again for Wilcoxon score à = (N + 1)/2 and o? = (N? — 1)/12. Hence it 
follows from the theory of LRS that under the null hypothesis, Eog(L) = m(N + 1)/2 and 
Varo(L) = mn(N + 1)/12. 

Again the distribution of L is symmetric under the null hypothesis. First of all we note that 
for the assumed choice of c;, we find that c; + cy_;41 is either 1 or 2. Thus c; + cy_j+1 is 
not a constant for all i and hence we need to check the other condition. However, for the 
assumed choice of a(i), we find that a(z) + a((N —i+1) = N+ 1Vi and hence the symmetry 
of L under the null hypothesis follows from the results of LRS. 

It is further interesting to note that Wilcoxon statistic L and Mann Whitney statistic are 


linearly related. For a proof first of all note that 


Rank of Yj) in the combined sample = Number of X(js less than Y) 


+ Rank of Yi among the Y observations........ (x) 
Now the rank of Yi; among Y observations is simply j and when summing over j, the first 


quantity in the RHS of (*) above gives the Mann Whitney statistic U. Again if we sum the 


LHS of (*) over j, we get Wilcoxon rank sum statistic L. Thus we get the relation 
L —U 4t m(m + 1)/2. 


Due to the above relationship, the tests based on Wilcoxon statistic and Mann Whitney 
statistic are equivalent and hence they share similar properties. It is already derived that 


tests based on Mann Whitney statistic are consistent. Then due to equivalence, tests based 


on Wilcoxon rank sum statistic are also consistent. Further, the asymptotic normality of the 
Wilcoxon statistic also follow from that of Mann Whitney statistic. 


It is easy to observe that Wilcoxon score vector satisfies Noether condition and hence L* — 


L-Enq(L) 
Vargg(L) 


on L*. In particular, a large sample test for Hp : 0 = 0 against Ha : 0 > 0 rejects the null 


is asymptotically standard normal. Thus we can perform large sample tests based 


hypothesis if L* > Tą. Similarly large sample tests for the remaining alternatives can also 


be constructed. 


1.1.2 Median score & test 


Next we choose a(i) = I(i > [¥H]). Since [MEL] is the median of the combined set 


{1,2,..,. N}, we have score unity when i exceeds the median and zero otherwise. This justifies 


the name median score. For median score, L reduces to 


N 
N+1 
ks Tae 
i=n+1 
which gives the number of the second sample ranks exceeding [2]. L is well known as 
Mood's median test statistic. For Median score a = N-[OPED/2D and o? = NIB OD (1 — 


(Acorn yy, Now from the first principle, it follows that 


# choices of (R4,,,.., Ry) out of (1,2,.., NF) : Ll 
pu 
Coe) (eel) 
ei 
for admissible choices of l. Thus the distribution is hypergeometric.Now it follows easily 


from the theory of LRS that under the null hypothesis, Eg(L) = m(1 — |(N + 1)/2]/N). 


Py (L = 1) = 


In a similar way, using the properties of hypergeometric distribution, Varo(L) can also be 
obtained. Now for the median score, we find that a(i) + a((N — i + 1) = 1Vi. Thus the 
symmetry of L under the null hypothesis follows from the results of LRS. 


From the general principle, the exact tests can be constructed. However, the median score 


L—Eng (L) 


Vara, is asymptotically standard 


vector satisfies Hoeffding condition and hence L* — 


normal. Thus we can construct different large sample tests based on L*. In particular, a 
large sample test for Ho : 0 = 0 against H, : 0 7 0 rejects the null hypothesis if |L*| > 74/5. 


Similarly large sample tests for the remaining alternatives can also be constructed. 


1.2 LRS in scale problems 


Suppose 


independently of 


y SR ek tiu 
02 


where F is unknown but known to be continuous. If 0; = o> but uı Z H2, the distribution 
of rank vector is not uniform. Hence no exactly or even asymptotically distribution free test 
can be constructed using L. But, if c4 = o5 and ji = u2 , the distribution of rank vector is 
uniform and hence exactly distribution free tests can be derived using L. However, if o1 = o» 
and py Æ u2 with (u1, 2) known, exactly distribution free test can also be derived using 
L. Thus in order to get distribution free tests using L, we assume X; zs F(x), 12. 
independently of Y; X F(2),j = 1,2,..,m, with c > 0. That is, we assume that the two 
distributions have a common location and they differ only in scale. Our objective is to test 
Ho: c = 1 against all alternatives. As earlier we take the LRS as L = Nr Ee: a(R;), that is 
L becomes the sum of scores corresponding to the second sample observations. 

As earlier, we need to enumerate the behaviour of ranks under a scale alternative to decide 
an appropriate choice of the score function a(.). 

First of all assume that the observations are positive with probability one, that is, F(0) — 0. 


Then 


o> 1 <> F(z) ar yery 
oO 


and 


o <1 <> F(z) < F( We SY EX. 
Oo 


Thus tests for stochastic alternatives(e.g. Mann-Whitney test) are all applicable for testing 
Ho: o = 1 against all alternatives. 

Next assume observations on the whole real line together with symmetry, that is F(x) + 
F(—z) = 1Vx. Then o > 1 implies that the second sample observations are expected to have 
lower concentration than the first sample observations. Naturally observations corresponding 
to the second sample observations are expected to be either too large or too small. Thus 
second sample ranks are expected to be either too smaller or too larger under e > 1 than 
under ø = 1. Thus if we take a symmetric U shaped score function, then a(R;) are expected 
to be larger under o > 1 than under o = 1 for every i > n + 1. Hence L is expected to be 
larger under c > 1 than under o = 1 and hence a right tailed test is suggested. Similarly 
L is expected to be smaller under ø < 1 than under c = 1 and hence a left tailed test is 
suggested. Again for testing c = 1 against o # 1 equal tailed test based on |L — Em (L)| is 
suggested provided the distribution of L is symmetric under the null hypothesis. One can 
also use a bell shaped symmetric score function, but in such a case, the rejection regions will 
be reversed. However, the distribution of L is discrete under the null hypothesis and hence 


exact size a tests will be, in general, randomized. 


1.2.1 Quartile score & test 
Note that for the set (1,2, .., N}, the first and third quartiles are [7] and [2%], respectively. 
N) [3N 
4l: La 


i ]) and assign zero 


In quartile score, we assign unity if i lies outside the interval (| 
otherwise. This suggests to define a(i) = 1 — I([E] ed xr [3] where i = 1,2,.., N. Then 
using the first principle, it is easy to obtain that 


(21/4) (N-2N/4) 


Go 


for admissible choices of l. As earlier, the large sample tests can be performed as usual. 


Pm(L=1) = 


1.2.2  Mood's score, Ansari-Bradley & Klotz normal score 


Mood, Ansari-Bradley and Klotz normal scores correspond to the choices a(i) = (i — “#+)?, 


a(i) = |i — 9| and a(i) = (9 (x). As earlier summary measures and large sample 


tests based on the standardised statistics can be constructed. 


2 Multiple sample location problem 


We have discussed so far tests for two independent samples. But un practice, we can have 
more that two populations to compare. Nonparametric methods can also be developed in 


such a situation. Specifically, we assume that 


ap UM cag cidem edi e dm 
for unknown but continuous F. Then our objective is to test Ho : 01 = 05 = .. = 0, against 


all alternatives. Thus we get a nonparametric analogue of one way classified data. 


2.1 An intuitive test 


Define R; as the rank of the i th observation in the combined sample (X11, .., Xini, .., Xkn,) of 
N = x pour n, observations. Then the distribution of ranks R;,i = 1,2,.., N is uniform 
under the null hypothesis. If the null hypothesis is true then the total sum of ranks, that is, 
N(N + 1)/2 is expected to be divided proportionally among the k samples. That is under 
Ho , the sum of ranks for the i th sample is expected to be n;(.N + 1)/2. If the sum of the 
observed ranks assigned to the ith sample is denoted by Rio, a reasonable test statistic could 
be based on the deviations between these observed and expected rank sums. Thus one can 


use the sum of these squared deviations as a reasonable statistic. In particular, we suggest 


k 


S = V (Ry - n(N + 1)/2P 


i=1 
to measure the deviation from the null hypothesis. Naturally a large value of S indicates 


deviation from the null and hence we reject the null hypothesis if S is too large. 


8 


2.2 Kruskal-Wallis Test 


Quite naturally S provides a distribution free test for the null hypothesis. However, a more 
useful test criterion is a weighted sum of squares of deviations, where the sample size recip- 
rocals are the weights. Following the above argument, Kruskal and Wallis (1952) defined the 


statistic 
12 


k 
He NUT) |» {Re —ni(N +1)/2}. 
As before, a large value of H indicates deviation from the null and hence we reject the null 
hypothesis if H is too large. However tests based on S and H are equivalent, when the number 
of observations from each sample are the same. 
A computationally easier form of H can be expressed as 

k 


12 R2 
H = 0 SEA +1). 


As before, being a function of ranks, H provides an exactly distribution free test. In addition, 
if min(n;,..,ny) — oo such that % — A; € (0,1) then it can be proved that H E X 
under the null hypothesis. Thus an asymptotically size a test rejects the null hypothesis if 
H > x$ 14. However, for more exposure on multi-sample problems, we refer the reader to 


the book by Gibbons and Chakraborti(2004). 


Nonparametric Inference: Module 19: 


What we provide in this module 


e Review of the concept of efficiency in estimation theory. 


e Comparison of tests 


Finite sample relative efficiency: The concept and problems 


Asymptotic power: Definition 


1 Co-Ordinator: Dr. Rahul Bhattacharya, Department of Statistics, Calcutta University 


1 A review of performance comparison in estima- 
tion 


1.1 Small sample comparison 


In statistical point estimation, we compare two estimators through some expected 
distance measure from the parameter of interest. For example, suppose we have two 
estimators, Tin and Ton based on n observations for the estimation of some real valued 
parameter 0. Then the relative efficiency can be measured by the ratio of Mean Square 
Errors(MSE's) 

E(Tin — 0)? 

E(T5, — 0)?' 
provided the existence. Naturally, the estimator with the smallest MSE is preferable. 
If the estimators are unbiased, the above ratio reduces to the ratio of variances or the 
ratio of precisions. Then, in such a situation, estimator with the smallest variance 
is the best. However, such a criterion depends heavily on the choice of the distance 
measure in addition to the existence of second order moments. Thus we need another 


criterion, known as Pitman Measure of Closeness(PMC), defined by 
P(Tin, Tan|O) = P(|Tin — 0| < |Ton — 00). 


Then, in its simplest form, PMC measures the relative frequency with which Tin 
is closer than 75, from the unknown parameter of interest 0. However, PMC does 
not measure the relative distances of Ti, and Ton from the true 0. But, calculation 
of PMC requires the sampling distribution of Tin and T», and hence is not easy to 


compute for a number of estimators. 


1.2 Large sample comparison 


Now assume that we have two consistent estimators Tin and Ton of some real valued 
parameter 0, that is Tkn £, 0 for each k=1,2. Since consistency is a desirable property 
in any estimation problem, rate of convergence would be a natural measure of relative 
performance. Thus, the estimator converging to 0 in a higher rate is better. But the 
problem is that rate of convergence is not easy to compute theoretically. However, 
if we assume the existence of MSE, then MSE(Tkn) = E(Thn — 0)? — 0 as n — oo 
ensures consistency of Tkn, k = 1,2. Thus the rate of convergence can be judged from 
the ordering of the MSE's. That is, the estimator corresponding to smaller MSE 
converges at a faster rate to 0 in probability than its competitor. In other words, if 
MSE(Ti,) < MSE(T»,)V0, Ti, is better in terms of the convergence rate. But again, 
such a comparison involves huge calculation and hence lacks practical feasibility. 

Thus we restrict to the class of estimators, which are not only consistent but also 


asymptotically normal. That is for every k — 1,2 
V/n(Ty, — 9) > N(0, 02) 


for some o7. Then as before, relative efficiency can be measured by the ratio of 
asymptotic variances. However, in this case the efficiency is called the Asymptotic 


Relative Efficiency ( ARE). In brief, suppose 


Vn(Tin — 9) 5 N(0,7?) 
and 


Vn(Toy — 0) 5 N(0,7?), 


where Tow is based on n’ = n'(n) observations. Then ARE of {Tin} with respect to 


{Ton} is defined as 
/ 
€19 = lim ale) (n) 
n—oco n 


provided the limit exists and is independent of the subsequence n’. Thus ARE can 
be interpreted as the relative number of additional observations required by the less 
efficient estimator to reach the same accuracy. Naturally, {Tin} is more(or less) 
efficient in the asymptotic sense if e12 > l(or < 1). However, in actual practice, the 
limit is not easy to evaluate and we use the equivalent form 


o2 


2 
€19 = 7: 
OT 


2 Performance comparison in hypothesis testing 


Suppose we have two competing tests for the same hypothesis. Then it is natural 
to inquire about the relative performance. In estimation theory, we have used some 
distance measure to measure the performance. In hypothesis testing problems, the 
only relevant measure is power, that is, a bounded measure of departure from the value 
specified by the null hypothesis. Thus if the tests are of same level of significance, 
then the test having a power curve lying above the power curves of all the competitors 
would be preferred. However, in nonparametric inference, the underlying distributions 
are never specified and hence construction of optimum tests is almost impossible. 
Therefore, at a minimum, we can compare the power functions for fixed sample size. 
But such a comparison depends on the sample size, the alternative, the form of 
the underlying distribution and the level of significance. Hence reaching a general 
conclusion will be difficult. 

Therefore, as an alternative one can use the the large sample methods to compare 
the tests. Although use of large sample approximation eliminates the dependence on 
sample size but creates a more serious problem. Because, we, in general, consider 
consistent tests so that power reaches unity for large sample size. Thus the limiting 
power function is of no use for the purpose of comparison. Pitman(1948), in this 


context suggested to compare asymptotic values of power considering sequence of 


4 


alternatives converging to the value provided by the null hypothesis. An appropriate 


choice of the sequence of alternatives enables a clear comparison. 


2.1 Finite sample efficiency 


Suppose Xi, X», .., Xn are iid observations from an unknown distribution G and our 
interest lies in testing Ho : Œ € Qo against Ha : G € Qa, where Qo(Qa) is the class of 
distributions specified by the null(alternative) hypothesis. 

Consider two different level o tests, 9D and 99. That is, for any k=1,2, 


Ego) <a for every G Epo 


Take an arbitrary G* € Qa and fixed 8 such that a < B < 1. If Nz, denotes the 
minimum number of observations required by test k to exceed a prefixed power f, 
Then the relative efficiency of test 1 with respect to test 2 is the ratio of the required 
sample sizes to attain power atleast 8. Precisely, if N; denotes the minimum number 


of observations required by test k to satisfy 
Exo > B for any k = 1,2 
GY Ng — y e asl) 


then the fixed sample relative efficiency is defined by the ratio 


Na(a, B, G*) 


ei(o, B, G^) = Nala BG 


Naturally test 1 is preferred over test 2 if e12(a, 8, G*) > 1 and if the opposite holds, 


we prefer test 2 to test 1. 


2.2 Asymptotic Power(AP) 


Now e12(Q, 8, G*) depends on a, 8 and G* and hence, as discussed earlier, makes 
the comparison difficult. Thus, before defining the asymptotic relative efficiency, we 


discuss the concepts of local alternatives and asymptotic power. Assume that the 


5 


underlying distributions can be indexed by a real parameter 0 so that the problem 
reduces to testing Ho : 0 € Qo against Ha : 0 € Qa. Then a sequence of alternatives 
{0n} is called Pitman's local alternative if 

(i) 0, € O, for all n > 1, and 

(ii) 9, — 09 € Qo as n — oo 

Now consider a sequence of tests {¢,} for testing Ho : 0 € Qo against Ha : 0 € Qa. 


Then (6,) is asymptotically size a if 
Ee, >a for all 0 € Qo 


Then the AP of (6, against the local alternatives (0,) is defined as lim, 5; Eo, bn 
provided the limit exists and lies in (a, 1). 

However, determination of AP is not straightforward and hence we provide below an 
equivalent expression considering some assumptions. 

Consider a sequence of asymptotically size a right tailed tests {¢,} based on {Tp} 
for testing Ho : 0 € Qo against Ha : 0 € Qa. That is 


On 1 if Ta > Cn 


Oif Tan < c, 
where {cn} are such that 
Ee, — a for all 0 € Qo. 


Then the AP of {¢,} against (0,) is given by 6 = limp. Ee, o, € (a, 1), provided 


the limit exists. 


Nonparametric Inference: Module 20: 


What we provide in this module 


e Asymptotic Power(AP)-The determination 
e Asymptotic relative efficiency: Definition and working formula 


e Few illustrations with interpretation 


1 Co-Ordinator: Dr. Rahul Bhattacharya, Department of Statistics, Calcutta University 


1 Asymptotic Power(AP)-The determination 


For the determination of AP, we need the following regularity conditions: 


Al. Suppose ĝo € Qo and 0 € Qa are such that 0 > 05. Then we take 0, = 0s + Jm b>0 


so that 0, € Q,Vn and 6, —> Oo € Qo as n > oo. 


A2. Suppose there exists Jr, (0) and oT, (0) > 0 such that under 6o 


and under (0, ), 


TT, (On) 
for some known continuous random variable W with distribution function H(w) = 


P(W < w). 
A3. dot) exists and is non zero and continuous in the neighbourhood of 4. 


A4. There exists a positive constant d such that 


m 1 bp, 4) 


? 


« / d 0 
with pip, (69) = 2021, 


: FT (On) _ 
A5. Mnh EE 1. 


Result 1: Under the above conditions, the asymptotic power(AP) of a test is P(W > 
Wa — bd) , where we is such that P(W > wa) = a. 


Proof: Since {n} is asymptotically size a, we get 


a = lim Pd > cn) 


n—> o0 


Ta — uT, (00) | Cn — wr, (00) 
= lim R 7 > T 
we T eu en) > 
P(W > wa) using A.2, 


2 


where wa is such that H(w4) = 1 — o. Thus, we get 


E a 


as n — oo. Now AP of {¢,} against (0,) is given by 


AP = lim P,(T, > c,) 
= lim R > ud 
dud or, (On) or, (On) 
Since 
Cn — UT, (Pn) Cn — Wr, (8o) or, (00) _ (8, — ELLA) = HT, (90) Or, (80) 


) 
or, (On) or, (00) or, (An) (On — 90)o7r, (Uo) or, (An) 
Cn — HTa (00) oT, (00) ber, (On) — UT, (90) or, (80) 

Or, (bo) Or, (On) Wnor,(90)  (On— 00) or, (On) 
— We — bd as n — oo, (using A1,A3-A5 and (*)) 


we have AP = P(W > wa — bd). Hence the proof. 
It is worth mentioning that AP € (o; 1). 


2 Asymptotic Relative Efficiency( AR E) 


2.1 The definition 


Now consider the testing problem Ho : 0 € Qo against Ha : 0 € Qa. Suppose we have two 
tests oW based on {Tın} and p? based on {Tən}. Both the tests are assumed to be right 
tailed and asymptotically size a. Let {k(n)} and {k'(n)} be two sequence of positive integers 
such that 

(i) {k(n)} and (K'(n)) are both increasing in n and 


(ii) as n — oo, k(n) > oo and K'(n) > oo. 
Again these sequences are such that 


a< lim Po, (Tik(n) > Cin) = lim Po, (Toi (n) > Con) <l. 
n= n— 0o 


3 


Then ARE of ow relative to and p? is defined by 


ARE(1/2) = lim © 


n>% k(n) 


provided the limit exists and is independent of the sequences {k(n)} and {k'(n)} satisfying 
(i) and (ii) above. 

Thus ARE is the limit of the ratio of the sample sizes required to reach the same lim- 
iting power against the identical sequence of alternatives for tests with the same limiting 
significance levels. Then ARE(1|2) = 2 means that we need approximately twice as many 
observations for test 2 as compared to test 1 to reach the same limiting power for the same 
sequence of alternatives. Naturally test 1 is better or worse than test 2 in the limit according 


as ARE(1|2) > 1 or <1. 


2.2 "The determination 


However, the definition of ARE as given above is not enough to compute in practice. Thus we 
need some equivalent expression so that the efficiency can be evaluated easily. The following 


result gives an equivalent expression of ARE. 


Result 2: /f the tests a) and o satisfy the regularity conditions (A1)-(A5), then 


* 
; 1 Hr, (90) 2 
lima oot Vn aq, (80 } 


ARE(1|2) = 


P 
: “1 HT, (90 2 
lim, sof Sea} 


Proof: From the definition, the sequences {k(n)} and (K'(n)) are such that 
lim Pa, (Tio > Cik(n)) = lim Py, (Tog(n) > C2 (n))- 


Since, "n and o satisfy the regularity conditions (A1)-(A5), lim; oo Po, (Lien) > Cik(n)) 
is nothing but the AP of 9D. Then we have from Result 1 that 


lim Polis = Cik(n)) = P(W > Wa — bd). 


n—> 00 


4 


Similarly, for 90 we get the expression of AP as 


lim I5. (Tok'nn > Cok! (n)) = P(W > Wa — bd»). 


n— o0 


Thus we require {k(n)} and {k'(n)} satisfying 


lim Ps, (Tikin) > tia) = lim Po, (Tokin) > C2k'(n)) 


1 UTs 90) 0 1 HT (90) 


) 

€ P(W > wa — bdi) = P(W > wa — bdə) 
) 
) 


e lim {— = lim 
n— oo n er, (90) 0 noo n OT 5444) (80) 
eum Ho) TET ere Pos i) = d of POD im p Pre P) 


i Le (0) 
seq OS limno t Fa ons a) 
jim > = Chee 
m (n) lim, ;1 T ch j^ 


This completes the proof. 


Remark1: From the expression of ARE, we observe that under the regularity conditions, 


the ARE can be interpreted as unc ratio of two efficacy measures, where efficacy of test based 


. : Ln, (80) 
on Tn is defined as lim, E NUN — y. 
ATI 


Remark2: The expression of ARE remains valid even for two sided alternatives. 
Remark3: The ARE does not depend on a and f as long as the regularity conditions are 
satisfied. Thus unlike fixed sample efficiency, the ARE does not suffer the disadvantages. 


2.3 Examples 


Now we shall provide some examples on finding the ARE in some testing problems. Suppose 
X;,i = 1,2,..,n are iid observations from F(x — 0), —oo < 0 < oo, where F(x) + F(—z) = 


1Vx. Consider testing Ho : 0 = 0 against Ha : 0 > 0. Now we have the following tests: 


e Test 1: Consider a test based on the t statistic Tin = vn&, which rejects the null 
hypothesis if Ti, > Cin, where (n — 1)s? = $357 (X: — XP. 


5 


e Test 2: Consider Sign test based on the statistic Toan = i Ma (X; > 0), which rejects 


the null hypothesis if Ton > con 


X,;+X, 
e Test 3: Consider Wilcoxon signed rank test based on 73, = Digg MgO) which 


rejects the null hypothesis if Tsn > Czn- 


Now consider the sequence of local alternatives 0, = i and assume o? = Varg(X1) < oc. 
First of all , we verify the conditions (A2)-(A5). Since s; 5 o, n and nX have the 
same asymptotic distribution. This suggests to define jp, (0) = /n% and eq, (0) = 1. 
Then conditions (A3)-(A5) are all satisfied with di = o~'. Now applying Central Limit 
Theorem(CLT), it can be shown that 
Tin — Um, (0) 
on, (0) 


under both 0 = 0 and 0 = 0,. Thus (A2) is also satisfied. Hence the efficacy of Test 1 is 


5 N(0, 1), 


" m 3 gin, (0) 2 
e(Tin) = Jim t non, (0)! 
vn/a 221g? 
ele 


Next consider Sign test statistic. Then Ey(To,) = P,(X, > 0) = F(0) ‚on, (0) = ,/ 220E 


n 


Since, by DeMoivre-Laplace limit theorem 


F(0)(1— F(0)) 


N(O, 1), 


under both 0 = 0 and 0 = 0n, we define ur, (0) = F(0) and on, (0) = 4/ POLUM Thus 


conditions (A2)-(A5) are all satisfied with dz = 2/(0), where f(x) = ar) exists at all x and 


continuous with f(0) > 0. Hence the efficacy of Test 2 is 


oque rl Hin (0) 
SUE» = Jim { n OTa, (0) 
_ (90) y2 _ 4 22/9), 


1 
VES 


P 


Now consider Wilcoxon Signed rank test statistic. Then 75, can be written as 


Tan = gj LIM > 01+ DUK +> 0) 


2 i—1 i<j 


2 jS 1 
= cquo P TOM) 
= ww 
2 


= U; + Us, 
n—l1 


where U, and U; are U statistics corresponding to the kernels (X, > 0) and J(.X1-- X5 > 0), 

respectively. Then from the theory of U statistics, it follows that 73, — U2 5 0 as n — oc. 

That is 73, and Up have the same asymptotic distribution. Now from the theory of U 
statistics it follows that as n — oo, 

Uz — Eg(U2) 

Varo(U2) 

under both 6 = 0 and 0 = 0,. Thus we set un,0 = Eo(U2) = Po(X1+ Xo > 0) = 


f E(x + 20)dF(x) and 05,0 = Vare(U2) = (pee Thus conditions (A2)-(A5) 


are all satisfied. Now assume that f(x) = aF@) exists and is continuous at all rz. In 


2, N(0, 1), 


addition, assume that differentiation under the sign of integration is valid. Then we get 
lip, (0) = 2 f^. f(x)dx assuming f^. f?(z)dx < oo. Again, under 0 = 0, F(Xi) has a 


R(0, 1) distribution and hence we get o7,,,(0) = Jm Thus we get the efficacy of Test 3 as 


il id ow 


= (V3 | ^ fG)da 


Now we can compare different pairs of tests using ARE. First of all consider comparing Sign 


test and test based on t statistic. The ARE takes the form 
e(T5,) 


e(Tin) 
40° f?(0) 


ARE(Sign|t) = 


Naturally, ARE depends on the underlying density f. Thus we provide below the expressions 


corresponding to different distributions. Thus we observe that Sign test is always less efficient 


7 


Table 1: Comparing Sign Test and test based on t statistic 


f(x) ARE(Sign|t) 
N(0,1) 2= 64 
Logistic(0,1) a = .82 
Double Exponential(0,1) | 2.0 

R(-1,1) + = 33 
Cauchy (0,1) oo 


than the t statistic based test except for Double exponential distribution. Roughly speaking, 
for normal parents, we would require 36% less observations for t test compared to Sign 
test to get the same performance. This is not unexpected as the sign test uses only the 
information about the signs of the differences. However for heavy tailed distribution like 
Double exponential, the quality of information used by the sign test is improved and hence 
becomes twice as efficient as the test based on t statistic. This shows that t statistic based 
test is no longer desired for double exponential distributions. For information, we note that 
within the class of all continuous unimodal symmetric densities, ARE(Sign|t) can not be 
lower than 3 (see, Hodges and Lehmann,1956, for details). Form the above table, we find 
that the only distribution attaining the lower bound is R(—1, 1). 
Now consider comparing Wilcoxon Signed rank test and test based on t statistic. Then we 
have 

e(T3n) 

e(Tin) 
= 120% f f'(x)dzy? 


ARE(SignedRank|t) 


As before to learn the performance of the competing tests, we compute ARE for different 
underlying density f. All these are provided in the table below. As earlier, we observe that 
the performances of Wilcoxon Signed rank test and the t statistic based test are more or less 


equivalent for normal, logistic and uniform parents. The improved efficiency is expected as 


8 


Table 2: Comparing Signed rank Test and test based on t statistic 


f(x) ARE(SignedRank|t) 
N(0,1) 

Logistic(0,1) 
Double Exponential(0,1) | 1.5 
R(-1,1) 1.0 
Cauchy(0,1) oo 


the Wilcoxon Signed rank statistic utilizes both the sign and magnitude of the observations. 
It is further observed that for most of the distributions, Wilcoxon signed rank test is almost 
same or more efficient than the t statistic based test. Again t statistic based test is no 
longer desired for double exponential distributions and hence for such parents the efficiency 
of the Wilcoxon Signed rank test is significantly higher than its competitor. However, in a 
seminal work, Hodges and Lehmann(1956) proved that for every continuous density f(x), 
ARE(SignedRank|t) is at least .864. 

Lastly, we consider comparing two completely nonparametric tests, namely, the Sign test 


and the Wilcoxon Signed rank test. Naturally , we arrive at the following expression of ARE 


ARE(Sign|SignedRank) 


f?(0) 
3C S f?(z)dz p? 


It is interesting to observe that note that 


ARE(Sign|SignedRank) = ARE(Sign|t) ARE(t|Signed Rank) 
ARE(Sign|t) 
ARE(SignedRank|t) 


Thus we immediately get the following table for different choices of f. The results are in 


agreement with our anticipation. Except for the double exponential distribution, the Sign 


Table 3: Comparing Sign Test and Wilcoxon Signed Rank Test 


f(x) ARE(Sign|SignedRank) 
N(0,1) 2 —.67 

Logistic(0,1) 3-75 

Double Exponential(0,1) | 1.33 

R(-1,1) 33 


test is less efficient than the Wilcoxon Signed rank test. We note that Sign test statistic only 
uses the sign of the differences whereas Wilcoxon Signed rank statistic utilizes both the sign 
and magnitude of the observations. The use of excess information makes the latter more 


efficient. 


3  Finally.. 


We have thus derived the theoretical expression of asymptotic power and consequently 
asymptotic relative efficiency. Although we restrict ourselves to the univariate distribu- 
tions b ut ARE can also be computed for paired and two sample cases. Along the same 
line of thinking, the ARE between Mann-Whitney test and the test based on t statistic can 
be computed. However, we left these as an easy exercise for the reader. In addition, we 
have discussed so far only Pitman's notion of ARE but there are alternative developments 
by Hodges and Lehmann(1956) and Bahadur(1960). We, therefore, suggest the interested 
reader to go through the book by Gibbons and Chakraborti(2004) and the references therein 


to get more exposure in this context. 


10 


Large sample Inference: Module 7! 


What we provide in this module 


e Uniform convergence 
e Examples 


e A mathematical definition of uniform convergence with examples 


‘Co-Ordinator: Dr. Rahul Bhattacharya, Department of Statistics, Calcutta University 


1 Uniform convergence 


We have already discussed pointwise convergence of sequence of functions. It is seen that 
for pointwise convergence, it is necessary to get a feasible v(x, €). So the question is natural, 
whether we can find a v’ = v'(e) depending only on e such that the convergence holds. This 
leads to the concept of uniform convergence. 

We explain the concept through an example. Consider f,(r) = z"/n?,v € (—1,1). Then 
it is already shown that f,(x) — 0 pointwise on (-1,1). One can recall that, for the above 
example we get a choice v(r,c) = [1/4/(c)] + 1. Naturally such a choice is independent 
of x. Thus we get a common value of v(x,c) ensuring pointwise convergence. Hence the 
convergence is more that pointwise and termed uniform(in the sense that a uniform choice 


of v(x, €) exists). 


1.1 Uniform convergence-Definition 


Suppose {f,(x)} is a sequence of functions on A to R. Then {f,(x)} is said to converge 
uniformly on A if for any given e > 0, there exists an integer v’ = v'(e) such that for all 
xeEAandn>r', 


fs) — F(@)| < e 


is satisfied. One can use the symbol f, = f to indicate uniform convergence on A. It is easy 
to observe that uniform convergence implies pointwise convergence and uniform convergence 
on A implies that on B where B C A. 

Now we shall discuss such a concept graphically. Suppose 1 f,(x)} is a sequence of functions 
on A to R such that f, = f on A. Then uniform convergence f, = f can be thought of 
the existence of some v’, independent of x, such that the graph of fa will lie inside the band 


(f — €, f +e) for every n > v. 


10 


00 05 10 15 20 25 3.0 


Figure 1: Uniform Convergence 


1.2 Uniform Convergence: How to check in practice? 


Suppose using some intuitive method, a limit function f is decided and pointwise convergence 
is established. Also suppose we get some v(z,¢€) for e > 0 and x € A. Consider a,b € A, 
then for a given e > 0 we get real numbers v(a,¢) and v(b,c) such that |f,(a) — f(a)| < 
e V n > v(a,e) and |f,(b) — f(b)| < e V n > v(b,e) are satisfied. Naturally, both the 
inequalities are satisfied if n > maz(v(a,c),v(b,ce)). The above suggests that if a feasible 


v'(€) = sup,eg V(x, €) exists for some B C A, then f, 3 f on B. 


1.3 Examples 


Example 1: Consider f,(2) = z?/n,x € R. Then for fixed x, we find that lim f,(x) = 0 
and hence we claim f(x) — 0V x € R. In our previous module, we have determined v(x, €) = 
[z?/e] +1. Naturally supper v(x, €) is not finite but for example sup, co, v(x, €) = [a?/e] +1 


exists for any finite a > 0. Thus f, = 0 on (0,a) whereas f,, — 0 pointwise for x € R. 


Example 2: Consider f (x) = x",r € (0,1). Then using the methods previously 


discussed we find that lim f(x) = f(x) pointwise on (0,1) for f(x) = 0 V x € (0, 1).The 
following v(z,c) = [log(e)/log(x)] + 1 was derived. Now lim, ,1. v(z,€) = +00 and hence 
convergence is not uniform on (0,1). However, v(r,c) is increasing in x and hence if 0 < 


b < 1, then sup, <9») v(z, c) = v(b,c) is finite. Therefore, the convergence is uniform over 


(0,5), b < 1. 


Example 3: Consider f,(r) = x"/n?,x € (—1,1). Then we have seen that pointwise 
convergence to f(x) = 0 V x € R is satisfied whenever n > [1/4/(c)] + 1. Thus, in this case 
v(z,c) = [1/A/(e)] + 1, which is independent of x. Hence the convergence is uniform over 
R. Thus we observe from the previous examples that pointwise convergence on A does not 
necessarily imply uniform convergence on A. But we can get a B C A on which convergence 


is uniform. 


Example 4: Consider falx) = na/(x + n), x > 0. Then it is already shown that 
f(x) — x V x > 0 pointwise. The corresponding choice for v(z, €) came out as [e/2?] + 1. 
Since lim, ,9, €/x? does not exist, convergence can not be uniform over (0,00). However, 
1/z? is a decreasing function and hence sup,c p v(r,c) is finite for B = [a,b) with a > 0. 


Hence the convergence is uniform over [a, b) for some positive a. 


Example 5: Consider f a(x) = nz/(x? + n?),z > 0. hen it is shown that f, > f(x) 
for f(z) = 0 V x > 0 pointwise on (0,00). In addition, we get the corresponding choice 
v(x, €) = [e/z?] +1. Then just as in the previous example, it can be shown that convergence 
is uniform over [a,b) for some positive a. Similarly, uniform convergence for the other 


examples can also be investigated. 


1.4 Uniform convergence-A mathematical definition 


We have discussed so far uniform convergence from first principles. However, such a definition 
is not always easy to check and hence we need a simpler condition. Suppose {f,(x)} is 


a sequence of functions on A to R which converges pointwise to some f(x) on A. Then 


f(x) 3 f(x) on BC A iff 


M, = BED lfa(z) — f(z)| > 0 


as n — oo. 

Consequently, we have the following results on sum and product functions. 

The first result is on the uniform convergence of a sequence of functions, where the n th 
term is a sum of two. Formally if f(x) = f(x) and g,(x) = g(x) uniformly on A then 

(i) cfa(x) 3 cf (x) on A and 

(ii) falz) + gn(x) 3 f(x) + g(x) on A. 


The next result is on the uniform convergence of a sequence of functions, where the n th 
term is a product of two. Formally if f(x) 3 f(x) and g,(x) = g(x) uniformly on A and 
there exists M > 0 such that |f,(x)| € M and |g,(x)| € M then 


fn) One) = f(x)g(x) on A. 


1.5 Examples 


Example 1: Consider f(x) = nrezp(—nz),x > 0. We have already seen that f,(x) > 
f(x) pointwise on x > 0, where f(x) = 0,x > 0. Now M, = sup, .onzezxp(—nz) and a 
simple algebra shows that ngezgzp(—nz) has a unique maximum at r = 1/n. Thus M, = 
exp(—1)which is non zero and independent of n. Thus the convergence is not uniform over 
[0, o0). However, if we consider B = [1,00], then M, — 0 and hence the convergence is 


uniform on B. 


Example 2: Consider f(x) = z", xz € [0,1]. Define f(z) 20if0 € x « land f(x) 21 
for x = 1. Naturally the convergence holds trivially for x=0,1.However, we have already seen 
that fn(x) — f(x) pointwise on x € [0,1]. Now M, = sup,¢jo1) 2". Since x” is an increasing 


function, it has a unique maximum at x = 1. Thus M, = 1, which is different from zero and 


hence the convergence is not uniform over [0,1]. However, if we consider B = [0, a], a < 1, 


then Mn — 0 and hence the convergence is uniform on B. 


Example 3: Consider f,(x) = I(|x| € n), where I is an indicator function. Then it is 
already shown that lim f(x) = 1 Vz pointwise. Now |f,(x) — f(x)| = I(|x| > n). Now for 
every n > 1, I(|x| > n) is either zero or 1. Thus M,, = 1 ~ 0 and hence the convergence is 


not uniform. 


Example 4: Consider f,(x) = z?ezp(—nz),r € [0,1]. It is easy to observe that 
falx) — f(x) pointwise on [0, 1], where f(x) = 0, x € [0,1]. Now M, = sup,¢j91) z^ exp(—nz). 
A simple algebra shows that z?exp(—nz) has a unique maximum at x = 2/n. Thus 


Mn = +exp(—2) — 0 and hence the convergence is not uniform over [0, 1]. 


Large sample Inference: Module 2: 


What we provide in this module 


e Convergence concepts 
e Examples 


e Applications 


1 Co-Ordinator: Dr. Rahul Bhattacharya, Department of Statistics, Calcutta University 


1 Sequence & their Convergence 


It is already discussed that large sample inference is attached with the performance evaluation 
when sample size becomes large. Naturally, large sample methods require the concepts of 
mathematical and stochastic convergence. Thus study of asymptotic properties in statistics 
depends heavily on results and concepts from real analysis and calculus. So we start with a 


in-depth review of the basic mathematical tools starting from sequence and series. 


1.1 Sequence of real numbers 


A sequence is a function from N, the set of positive integers to the real line R. Thus a 
sequence is a set of real numbers given by a,n > 1 such that a, : N — R. That is for every 
n € N, an € R. Naturally, the domain in this case (i.e. N) is countable and the range is 
the whole real line R. A sequence, therefore, generate sequentially for increasing values of 


neN. 


As examples, one can thought of a, = 1/n?, a, = sin(n), a, = n and a, = (—1)". 


Consider a, = 1/n?, then we have the values a4 = 1,a2 = 1/4 and so on. Thus we find that 
the value of a, decreases with n. As another example, if we consider a, = n, we find that 
the value increases with n. However for a, = (—1)", we get that the sequence is either +1 
or -1, i.e oscillates between -1 and +1. Thus we find that, sequences may be increasing or 


decreasing or bounded. 


1.1.1 Monotonic sequences 


A given sequence an,n > 1 is monotonically increasing if an+ı > a, for every n. Similarly, a 
sequence is monotonically decreasing if an+ı < an for every n. In either case the sequence is 
called monotonic. It is easy to observe that n, yn, e" are examples of increasing sequences. 


On the other hand 1/n?, 1/(,/n + 1) are the examples of decreasing sequences. 


0.0 0.5 1.0 


-0.5 


-1.0 


Figure 2: Oscillating Sequences 


o9 o 
002° 
9 
o9 e | 
o9 N 
o9 
o9 
o9 
o9 4n N | 
[9] N 
[9] 
[9] 
[9] 
[e] [e] 
e | 
o 2+e" 
o Oo 
a 7 Ooooooooooooooo 
| | | | | | | | 
10 15 20 25 30 5 10 15 20 
n n 
Figure 1: Increasing & Decreasing Sequences 
[9] [9] 2 -1 [e] [0] [9] [9] [9] 
/ N | 
Q Sin(n) 9 an. cay 
o o o z 
O 
ec 
e 
[9] 
wo 
[e] O F J 
O p 
N / 
o [e] 9 2 o o o o o 
| T T T l T T T T T 
5 10 15 20 2 4 6 8 10 
n n 


1.1.2 Bounded sequences 


A sequence an,n > 1 is said to be bounded if there exists kı and ks such that 
kı < an < k2 
is satisfied for every n, where kı and ks are independent of n. Since 
ky < an € ko => -|kı| — |k2| < an < |kı| + [kəl 


we get that |an| < k for some k. Thus bounded implies |a,,| < k for some k. a given sequence. 


For example, we have (—1)” and Sin(n). 


1.1.3 Subsequence 


Suppose an,n > 1 is a given sequence and k(n) is an increasing sequence of positive integers. 
That is, k(n +1) > k(n)Vn. Then the new sequence {ak(n)} is called a subsequence of 
an,n > 1. For example consider a sequence ap = 1/n and k(n) = 2n.Then the corresponding 
subsequence is 1/2n, n > 1. Now consider a sequence an = (—1)" and k(n) = 2n. Then the 


corresponding subsequence is a constant sequence containing only 1. 


1.2 Convergence of a sequence 


Before going to the formal definition, let us consider some illustrative examples. Consider 
two sequences 1/n?, n > 1 and Sin(n)/n,n > 1. Plot will be helpful to study the behaviour 


of these sequences separately as we increase n. 


It is easy to observe that, the curve decreases with increase in n. Finally, after a large 
value of n, the curve of 1/n? becomes very close to the zero mark. However, for the second 
sequence, the curve wastes first few n to become stable. But as earlier, the curve maintains 
a little distance with the zero value for large n. Thus, in each case the value of the sequence 
becomes very close to a postulated value0, in these cases) as we exceed a large value of n. 


This gives the basis to define the convergence of a sequence. 


4 


3 ds 3 J? 
p 
© z o ] 
o G^ Sin(n)/n 
24 1/n? S n 
b 
e 
Eres] 
e 
e ul 
e 
0 20 40 0 50 100 150 
n n 


Figure 3: Convergence 


1.2.1 Convergent sequence 


A sequence ap, n > 1 is said to be convergent if there exists a € R such that for every e > 0, 
there exists a positive integer v = v(e) for which |a, — a| < e is satisfied for every n > v. If 
such an a exists, we write 


lim an =a 
T— o0 


and define a as the limit of the sequence. Let us consider few examples. 

Example 1: Consider a, = 1+ exp(—n), which is a decreasing sequence. Then by in- 
spection, we identify a as 1. Thus if a feasible v(e) exists for every € > 0, then lima, = 1. 
For e > 1, |1 + exp(—n) — 1| < e is satisfied for any n > 1. However, if 0e < 1, then 
|1 + exp(—n) — 1| < e gives n > [—log(e)] + 1. Combining we set v(e) = |—log(e)] +1. Since 
for every e > 0, we get a feasible choice, lima, = 1. 

Example 2: Consider a, = 1/n, which is a decreasing sequence. Then by inspection, we 
identify a as 0. Thus if a feasible v(e) exists for every e > 0, then lima, = 0. If e > 1 then 


|1/n — 0| € € for any n > 1. However, if 0 < e < 1, then |1/n — 0| < € gives n > [1/e] + 1. 


S 4 [9] o-70 
+ / 
o | 
/ 
o 2 ^ 
S q 5 T 4 o -lg() 
o [9] 
rà 
o 5 us 
a 7| o q - % 
,9 % 
nho 
o „O 
oO - [9] oO | 
T— fon | 
hs 
[o] 
[9] 
o +000 9 "vo, 
T T T T AS T T T T T 
5 10 15 20 0 10 20 30 40 50 
n n 


Figure 4: Divergence 


Thus we set v(e) = [1/e]+ 1. Since for every € > 0, we get a feasible choice, lima, = 0. 


Example 3: Consider a, = yn + 1— yn. Rationalizing we get a, = (vn + 1-- n), 
which is a decreasing sequence. Then by inspection, we identify a as 0. Now (/n+1+ 
Vn) ! < 1/ n and hence |a, — 0| < e will be satisfied if |1/\/n — 0| < € is satisfied. The last 
inequality gives n > [1/e?] + 1 for e < 1. For e > 1, the inequality is satisfied for any n > 1. 


Thus we can set v(c) = [1/&?] + 1. Since for every € > 0, we get a feasible choice, lima, = 0. 


1.2.2 Divergent sequence 


From the analogy, if a sequence is not convergent, it is divergent. Let us consider some 
examples. Consider two sequences n?,n > 1 and —log(n),n > 1. Plots will be helpful to 


study the behaviour of these sequences separately as we increase n. 


It is easy to observe that, a, = n? increases enormously whereas ay, = —log(n) decreases 


as we increase n. Naturally, we can observe any finite number to which a, converges. In 


fact, after a large value of n, a, becomes infinitely large or small. This gives the idea of the 


converse of the convergence, that is, divergence. 


Thus formally, a sequence an,n > 1 diverges to +00 if for any M(» 0) there exists v( M) 
such that 
an> M V n2v(M) 


is satisfied. On the other hand a,, > 1 diverges to —oo if —a,,n > 1 diverges to +00. 

Let us illustrate the idea through an example. Consider a, = n?. Then a, > M implies 
n > [V M] +1. Thus setting v(M) = [V M] + 1, we get that n? diverges to --oo. Note that 
(—1)" is also an example of a convergent sequence as it does not converge to a fixed value 


but oscillates between -1 and 4-1. 


2 Few results on convergence 


Now we shall provide few convergence results without proof, which are useful in further 


development. 


1. A convergent sequence is always bounded. 

2. Limit of a convergent sequence is unique. 

3. A sequence is convergent iff every subsequence from it is convergent. 
4. A monotone sequence is convergent iff it is bounded. 


5. If a, > b V n> 1 for fixed b, lima, > b provided the limit exists. 


6. If lima, =a and limb, = b then 
(i) aan + b, — aa +b 
(ii) anbn — ab 


(iii) f(a,) — f(a) for continuous function f. 


7 


For a detailed proof, we refer the reader to the book by Rudin(1976). Next we shall give two 


useful approximations. The first one is Stirling's approximation, given by, 
n! z v2ne "n" *3, 
The next one is on a limit result on exponential approximation, given by, 
c 
lim(1 + —)" = e* 
(14 5) 


for fixed c € R. 


3 Applications in large sample theory 


Now we shall discuss an application of the above results in large sample theory. Suppose X 
has a Poisson distribution with mean 5 and we are interested in P(X=n) for a large n. Ob- 
serve that for n > 5, 5"/n! < 5°/5!(5/n). Thus lim 5"/n! = 0 and hence lim P(X = n) = 0. 
Thus for large n the probability is negligibly small. Let us compute P(X = n) for different 


values of n. 


n 20 30 40 
P(X=n) | 2.64X 10-7 | 2.36 X 10-4 | 7.5 X 10-3 


From the above table it is easy to observe that the limiting value of zero is reached for n 


exceeding 30. 


Large sample Inference: Module 3: 


What we provide in this module 


e Convergence of Series 
e Applications in large sample theory 


e Bounds of sequences 


‘Co-Ordinator: Dr. Rahul Bhattacharya, Department of Statistics, Calcutta University 


1 Series & their Convergence 


Series is a very widely applied concept in statistics. For example, Poisson , Geometric and 
Log series distributions are well known in statistical literature. It is interesting to observe 


that each of them is derived from an infinite series. For example, from the identity 


0* 
x=0 
we get the pmf of log-series distribution 
P(X «pee zn x = 0,1,2 
i zln(l—0)"  ^"/^^"7 


1.1 Series 


For a detailed development, consider a sequence of real numbers {an}. Define another 
sequence as as Sn = $7, ak, n > 1. The pair of sequences {an} and {sn} define an infinite 
series. A series is defined by $77 , an, where a, is called the n th term of the series and sp 


is the n th partial sum of the series. 


1.1.1 Convergence of series 


PAGE an is ordinarily convergent or divergent according as the sequence {sn} is convergent or 
divergent. Suppose {sn} is such that lim s, = s , where we allow the possibility of s = too. 
If s is finite, we say that the series converges ordinarily and then $77. , an = s. However, if 


s = +œ, the series diverges and we say that the series diverges to +00 or —oo. 


We further have two other modes of convergence for a series containing both positive 
and negative terms. » 7^ , a, is absolutely convergence if the series containing the absolute 
values of a, is convergent. That is $77 , an is absolutely convergence if $77. , |an| is conver- 
gent. Again $77 , a, is conditionally convergent if $77 , a, is convergent but $77 , |a,| is 


divergent. Note that Absolute convergence= Ordinary convergence but the converse is not 


always true. 


Example 1: Consider the series 377 , n~”. Then the n th partial sum is simply 
Sn =1 +42? Rer. 


Note that {sn} is an increasing sequence. Thus 2"*! — 1 > n implies s, < Sgn+1—1. However, 
it can be shown that 553544 4 < 2—2" < 2V n > 1. Thus s, < 2 V n > 1 and also 
Sn >1V n > 1. Hence {sn} is monotone and bounded. Thus by MCT, {sn} and hence the 


series converges. 


Example 2: Consider the series $77 ,(—1)". Then the n th partial sum is either 0 for 
even n or -1 for odd n. Thus 
—1 + (-1)* 
pcc cet 
Naturally {sn} is oscillating and hence does not converge. Then by definition, the series 


diverges. The series also diverges absolutely. 


Example 3: Consider the geometric series $77 , r"-!. Then the n th partial sum s, is n 


for r = 1 and —— for r £1. If |r| < 1, s, > (1— r) !. However, if r > 1, r^ > oo and 
hence s, — oo. If r € —1, s, oscillates. Thus combining we get that the geometric series 
converges for |r| « 1 and diverges otherwise. 

Example 4: The p series 377 , n^? is known to converge iff p;1. Thus the harmonic series 


oo E 
do n~} diverges. 


Example 5: Consider the series 1 — i + I +... It is well known that the above series 
converges to In(2). However, the corresponding series with absolute values is nothing but 
the harmonic series. which diverges. Thus we find that 1 — 5 + 3 +.. converges but the series 
with absolute values $>% +. Hence the series is conditionally convergent. 


n=l n 


Example 6: Now we shall give an interesting example on rearranged series. In fact, the 


3 


terms of a conditionally convergent series may be suitably rearranged to converge or diverge. 
Consider the series s = 1 — į + 4+.., where s = In(2). The series is clearly conditionally 


convergent. Consider the rearranged series: 
1-4 (1/2 — 1) + 1/3 + (1/4 — 1/2) 4- 1/5 + (1/6 — 1/3) +... 
Now separating the positive and negative terms we get from the above 
(1c 1/2 - 1/3 - ..) - (1- 1/2 - 1/3 4 ..) 
Now consider the following rearrangement of 5: 
1 1 
Uaec socio 12 59 Ieee C Ded ge 


, Now adding with s, we get 


$4 2 — (14-0) 4- (1/2 + 1/2) + (0 4- 1/3) + (1/4 — 1/4) + (1/5 — 0) +... 


Thus the RHS is nothing but 1 3-1/3— 1/24- 1/5 4- 1/7 — 1/4 4- ... Then we get three possible 
values, corresponding to different rearrangements, of a conditionally convergent series as 
In(2), 0 and 3ln(2)/2. For a detailed discussion on this issue, we refer the reader to Methods 
of Real Analysis(1963) by R.R. Goldberg. 


2 Applications in probability theory 


'The above observation regarding rearranged series forms a basis to define the existence of 
moments. 

Suppose X is random variable with the probability mass function P(X = z;) = pi, > 1. 
Then for some function g, Eg( X) is said to exist iff the series 5 7; g(x;)p; is absolutely conver- 
gent. Now we shall give an example to show the usefulness of above requirement. Suppose 
dcm (-1)712. i > 1 and p; = 2^*. Then the series Y 7; z;p; is ordinarily convergent to In(2). 


However, the series is not convergent absolutely and hence E(X) does not exist though 


4 


>=, vip; is convergent. 


Another application of convergence of series is to check the existence of moments of 
distributions with support as the whole set of integers. Suppose P(X — n) — Cat = 


1,2,.. for some c > 0. Then c is such that 377 , P(X = n) = 1. Now consider the n th 


zA 


GD: An adjustment using partial fractions, 


partial sum of the above, which is s, = $5, c 


zt- 
n41 


c= 1. However, we find that the series $77 , nP(X = n) diverges and hence E(X) does not 


we get Sn = c(1— —). Now s, — c and hence, we get $577 , P(X = n) = 1 only when 


exist. 


3 Supremum & Infimum of a sequence 


We have already seen that the sequences can diverge. However, the asymptotic behavior 
of a divergent(mostly oscillatory) sequences can be described by considering the behavior 
of the bounds of the sequence. For (a, , l(u) is a lower(upper) bound if a, > 1 (an € u) 
V n 2 1. From the properties of real numbers, if a sequence admits a lower(upper) bound, 
it also has a greatest(least) lower(upper) bound. The least upperbound and the greatest 
lowerbounds are defined, respectively, by w = sup, ,@, and ly = inf,21a,. The limiting 


behavior of any sequence can be studied from the behaviour of these bounds for increasing n. 


Note that sup,..,, an is a decreasing sequence and hence we get the limit as infn>1 sup,..,, an- 
The above quantity is known as limit superior of a, and is denoted by either limsupa, or 
lima,. Again infk>n An is an increasing sequence and we get the limit as sup, , infys4 Gn. 
The above quantity is known as limit inferior of a, and is denoted by either lim inf an or 
lima,. Unlike limits, these two quantities always exist and can be used to judge the be- 
haviour of a sequence in the limit. 


Let us enumerate some results on these without proof. 


1. a,,n > 1 is convergent iff lima, and lima, are both finite and equal. 


2. For any an,n > 1, lima, > lima, 

3. If lima, is finite then lima, = —lim(—a,) 

4. If an,n > 1 and b4,n > 1 are two sequences with finite lim and lim , then 
(i) lim(a, + bn) € lima, + limb, 


and 


(ii) lim(an + bn) > lima, + limb. 
5. If a,,n > 1 and b,,n > 1 are two positive sequences with finite lim, then 
Wai. epa Dnbs 


and 


lim(anbn) 7 lima, limb, 


3.1 Examples 


Example 1: Consider an = e~",n > 1. The sequence is strictly positive and monotonically 
decreasing in n. Now sup,s,, 4k = sup; e ^ = e ^. Then lima, = inf,>;e7" = 0. Again 
infj;, ak = infyss e-* = 0. Thus lima, = sup,>1 0 = 0. Thus lima, = lima, = 0 and hence 


the limit exists and lima,, = 0. 


Example 2: Consider a, = 1+(—1)",n > 1. The sequence is non negative and oscillating 
between 0 and 1. Now supgsy ak = 1 + supys,(—1)* = 2. Then lima, = inf,>;2 = 2. Again 
infk>n ak = 1+ inf p>n(—1)* = 0. Thus lima, = 0 but lima, = 2. Since these two quantities 


are different, limit does not exist. 


10. 


Large-Sample Inference: Module 4! 


Learn More 


. Rohatgi, V.K. and Saleh, A.K.(2002). An introduction to probability and statistics. 


Second Edition, John Wiley & Sons Inc., New York. 


. Tom M. Apostol(1974). Mathematical Analysis. Addison-Wesley Publishing Company, 


Inc.. 


. Robert G. Bartle and Donald R. Sherbert(1972). Introduction to Real Analysis, John 


Wiley & Sons Inc. New York. 


. Richard R. Goldberg(1964). Methods of Real Analysis, Blaisdell Publishing Company. 


. M.H. Protter and C.B. Morrey(1991). A First Course in Real Analysis, Springer- 


Verlag. 


. Kenneth A. Ross(1980). Elementary Real Analysis, Springer. 


. Walter Rudin(1976). Principles of Mathematical Analysis, Third Edition, McGraw 


Hill Inc. 


. Bandyopadhyay, S.(2011). Mathematical Analysis-Problems and Solutions, Academic 


Publishers, Kolkata. 


. Malik, S.C. and Arora, $.(2000). Mathematical Analysis, Second Edition, New Age 


International(P) Limited, Kolkata. 


Mapa, S.K.(2004). Introduction to Real Analysis, Fourth Edition, Sarat Book Distrib- 


utors, Kolkata. 


1 Co-Ordinator: Dr. Rahul Bhattacharya, Department of Statistics, Calcutta University 


1 


Large sample Inference: Module 5! 


What we provide in this module 


Further examples of series convergence 


Big O and Small o notations 


Gauss’s test and examples 


Sequence of functions 


‘Co-Ordinator: Dr. Rahul Bhattacharya, Department of Statistics, Calcutta University 


1 Further Examples 


Example 1: Consider the series $77 ,(n + 1)”. Then a, = (n + 1)” and hence 
(an)! = (n 4- 1)7!. Thus lim(a,,)!/” = 0 < 1 and hence by root test, the series converges. 
Example 2: Consider the series 377 ,n"/n!. Then a, = n"/n! and hence a444/a, = 
(1 4 1/n)". It is well known that lim(1 + 1/n)" =e > 1. Hence by ratio test the series 
converges. 

Example 3: Consider the series 1/3 + 3 + 1/3? -- 3? + .... Then ag, = 3" and dgn41 = 37”, 
where n = 0,1,2,.. and lim(a;,)!7^ = 4/3 and lim(as,,,)7^*! = 1/4/3. Hence lima, = 


V3 > 1. Thus by root test the series is convergent. 


Example 4: Consider the series $77 ,(2 + log(n))~*. Naturally a, = 1/(2 + log(n)) = 
1/(log(ne?)) and it is a decreasing quantity. Now b"ayn = b"/(log(e?0")) = b” /(2 + nlog(b)). 
Since by definition b > 1, b" > 1 and hence we get b"a, = b"/(2 + nlog(b)) > 1/(2 + 
nlog(b)).Now b is a constant and $`, 1/(2 + nlogb) is known to be divergent. Hence the 
original series diverges by Cauchy condensation test. 

Example 5: Consider the series $77 ,(nlog(n)) ?. Naturally a, = 1/(nlog(n))?. Now 
bag = b"/(b"log(b”))? = b" ""/(nlog(b))*. Now for p > 1, t"? < 1. Since bisa 
constant, we get by comparison test that the series $^, n^? converges for p > 1. However 
for p=1, bap = 1/(nlog(b)) , which is divergent. Now for p < 1, 0^"? > 1 and hence 
bayn > l/(nlog(b))?. Since b is a constant, we get by comparison test that the series 
diverges for p « 1. 

Example 6: [p series revisited] Consider the series $ 77 ,(n)^?. For the p series, we observe 
an = 1/n? is decreasing in n. Now bap = 6"/(b")? = b=, Now for p > 1, b» < 1 
and hence »$70"ay. becomes a geometric series with common ratio less than unity. Thus 
the series converges for p > 1. For p=1, 0"ay. = 1 , which is a constant and therefore, by 
definition, the corresponding series diverges. Now for p < 1, bt? > 1 and hence from the 


properties of geometric series, the series diverges. 


2 Growth orders 


2.1 Big O notations 


The symbols O was first introduced by E. Landau. The letter O is used to indicate growth 
rate of a function. Growth rate is commonly known as its order. These notations are often 
useful to compare growth rate and decrease rate of functions. Suppose a, and b, are two 
sequences, then an = (O(b,) as n — oo iff Dm is uniformly bounded for n > no. That is there 
exists a constant M and ng such that |;*| < M whenever n > ne. However, a, = OX(1) 
implies that a, is itself bounded for large n. If a, = O(b,) => t» = O1), then a, and bn 
are said to grow or shrink in the same rate. 


For example, consider the polynomial ag + ayn + azn? + a3n?. Then we can write the 


polynomial as O(n?) whatever be the coefficients. However the polynomial an~t + agn? + 
a3n. ? is represented by O(n"). If an is a finite sum/difference of a number of functions, 
then the fastest growing quantity gives the order. Consider a, = n? + n + log(n), then 
naturally, the fastest growing quantity is n?. Thus we can write a, = O(n?) as n — oo. On 
the other hand if a,, is a product of several factors, the constant term is omitted from the 
order relation.For example, consider the polynomial a, = 44-6n? —2n?. Naturally the highest 
power is 5 and hence conventionally an = (O(n?). However, if we follow the definition, we 
need the auxiliary sequence b,. Note that |a,| < 4 + 6n? + 2n? < 4 + 6n? + 2n? = 12n? for 
n > 1. Thus we get M—12 , no = 1 and b, = n? On the other hand if a, is a product of 
several factors, the constant term is omitted from the order relation. 

Big O notation is useful in computer science to compare efficiency. Suppose a program of 
size n(in some sense) takes the time a, = n?+2n to give the output. Another size n program 
for the same purpose takes the time bn = n? — n? to give the output. Naturally the program 
taking less time is efficient. Now the latter program has the leading term n?, which grows at 
a faster rate that the latter with increase in n. Thus big O notation can be use to indicate 


the time efficiency. Since a, = O(n’), it is more efficient. 


2.2 Small o notations 


The small o notation is used to capture the growth comparison of two quantities. If a, and 
bn are two sequences, then a, = o(b,) as n — oo if for every positive constant e there exists 
no = no(c) such that |f| < € for every n > no. This simply means that a, = o(b,) iff 
lim A = 0. Naturally small o is different from big O in the sense that , for the former, the 
inequality has to be true for some M whereas for the latter the inequality is to be satisfied 


for every positive e. Thus a, = o(b,) => an = O(bn) but the converse is not true. 


2.3 Tests related to order of growth: Gauss’s test 


Suppose 57, a, is a series with positive terms such that 


a Bn 
=1+-4 
An+1 n nP 


An 


? 


where p > 1 and 6, is a bounded sequence. Then the series converges if œ > 1 and diverges 


if a € 1. Alternatively suppose 57, an is a series with positive terms satisfying 


with p > 1. Then the series converges if o > 1 and diverges if a < 1. 

224?...(2n—2)? 
n22 325?...(2n—1)?* 
lim =+ = 1. Thus Ratio test fails. Now ;"- = 1-4 1/n4 1/4n? = 1 + 1/n + O(n). 


+1 


As an example, consider the series Y; Then it is easy to observe that 


Hence by Gauss test the series diverges. 
As another example consider the p series `>; n”. Then we get a,/as41 = (1+1/n)? = 


1 4- p/n + O(1/n?). Thus the series converges for p > 1 and diverges for p < 1. 


3 Sequence of functions 


We have already discussed convergence of sequence and series. However, in many applica- 


tions, we actually get a sequence or series involving some real variable x. For example, one 


can consider, power series distributions, log series and e series as a function of x. Thus in 
addition to n there is an additional variable x. Naturally, convergence concepts are to be 


reformulated taking into account the presence of x. 


3.1 Definition & convergence 


Let A be a nonempty subset of R. Suppose for each n € N, f, : A — R be a function. 
Then (/f,(x)) is defined as a sequence of functions on A to R. A is defined as the domain 
of the sequence of functions. z",z € |-1,1], exp(—nz),x € R and Sin(nx)/n,x € R are 
examples of sequence of functions. It is easy to observe that for each fixed zo € A, we get a 


real sequence ( f, (xo). 


We have already introduces the convergence concepts for a real sequence {an}. But in this 
case, we have an extra variable x. Thus it is intuitively clear that we need to develop separate 
convergence concepts depending on the role of x. For example, consider z", x € [—1, 1]. For 
x = 1/2, we get the sequence (1/2)", which converges to zero for large n. However, if we 
take x = —1, the sequence oscillates between -1 and 1 and hence diverges. Thus depending 


on x, the sequence either converges or diverges. 


Large sample Inference: Module 6! 


What we provide in this module 


e Pointwise convergence 


e Examples & Applications 


‘Co-Ordinator: Dr. Rahul Bhattacharya, Department of Statistics, Calcutta University 


1 Pointwise convergence 


We have already defined sequence of functions and the requirement of convergence. Now we 


provide the convergence concepts of a sequence of functions in details. 


1.1 Definition 


Suppose (/f,(x)) is a sequence of functions on A to R. Then {f (x)} is said to converge 
pointwise on A if for each x € A, the sequence converges. Thus for every fixed ro € A, the 
sequence 1 f,(2o)} converges and hence there exists some f (xo) such that lim f,(xo) = f (xo). 
Since zo is fixed, f(a) exists for every x € A. Such f is defined as the limit function of { f, (x) 


and we write f, — f pointwise. 


1.2 How to get f? 


Although we define an f theoretically but in practice getting such an f is not easy. We shall 
explain deciding an f using examples. Suppose (f,(x)) = z?/n,x € [0,1] is a sequence of 
functions on A to R. Let us plot the functions for different values of n on the same graph 


paper. We, in addition, plot (f,(x)) = z",x € [0, 1) for different values of n. 


Thus we observe that for the first figure, the functions approach zero as we increase n. 
Then it is easy to take the limit function f(x) = 0 V x € [0,1]. Intuitively, the above is 
justified as for fixed z, z?/n — 0. However, for the second function we get a sequence of 
decreasing curves. Then it is easy to expect the horizontal axis as the limit. This suggests 
to take the limit function f(x) = 0Y x € |0, 1). Thus we find that after a large n, the graph 
becomes very close to the limit function. Obviously, such an n depends on the value of x. 
Thus f, — f implies that for every x € A and for every positive e, there exists a positive 


integer v(x, €) such that |f, (x) — f (x)| < € is satisfied for every n > v(x,c). However, looking 


x^2/n 


Figure 1: Convergence for a sequence of functions 


at the nature of lim f(x) for fixed x, f is defined in practice. 


1.3 Examples 


Example 1: Consider f,(r) = z?/n,x € R. Then for fixed x, we find that lim f,(x) = 0 
and hence we claim f(x) — 0 V x € R. However, the claim will be justified, if we can find a 
feasible v(x, €), as defined above. Now |f,(x) — f(x)| < e implies n > [x?/e] +1 Thus we get 
v(x, €) = [x?/ce] +1. Hence lim f,(x) = f(x) pointwise for f(x) = 0 Vx € R. 


Example 2: Consider f,(r) = z",x € (0,1). Then for fixed x, we find that x” is a 
positive fraction and hence lim f,(z) = 0 and we claim f(z) —- 0V xz € R. Fore 1, 
|fa(x) — f(x)| < € is satisfied for any n > 1. However, for e € (0,1), |f, (x) — f(x)| < € gives 
n > |log(e)/log(x)| + 1. Thus, in any case we can take v(x, €) = [log(c)/log(x)] + 1. Hence 
lim f(x) = f(x) pointwise. 


Example 3: Consider f,(r) = z"/n?, x € (—1,1). Then for fixed x, we find that x” — 0 
so that lim f,(r) = 0 and we claim f(x) = 0 V x € R. However, finding a v(r,c) from 
|f (x) — f(x)| < eis quite complicated. Now it is easy to observe that |z"/n? —0| < |1/n? —0| 
for every x € (—1,1). Thus, |1/n? — 0| < e 2 |z"^/n? — 0| < e. Now fore > 1, |1/n? — 0| < e 
is satisfied for every n > 1. However, for e € (0,1), we get n > [1//(6)] +1. Thus, in any 
case we can take v(x, e) = [1/4/(c)] + 1. Hence lim f,(x) = f(x) pointwise. 


Example 4: Consider f,(r) = nz/(x--n), x > 0. Then for fixed x, we find that n/(a+n) > 
1 so that lim f,(2) = x and we claim f(x) 2 z Vz > 0. Now |f,(x)—f(z)| = 2 < E Thus 
\fn(x) — f(x)| < eis satisfied, if 2 < eis satisfied. Now 2 < e gives n > [e/z?]-- 1. Thus, we 


define v(z, e) = [e/z?] - 1, which is a feasible choice and hence lim f,(x) = £, x > 0 pointwise. 


Example 5: Consider f,(x) = nz/(x? + n?),r > 0. Then for fixed x, we find that 
n/ (x? +n?) > 0 so that lim f,(x) = 0 and we claim f(x) 20 Y x > 0. Now |f,(x) — f(x)| = 
nx/(x? +n?) < £. Thus |f,(x) — f(x)| < e is satisfied, if = < e is satisfied. Now = < e 
gives n > [e/x] +1. Thus, if we define v(x, c) = [e/x] + 1, we get a feasible choice and hence 


lim falx) = 0,2 > 0 pointwise. 


Example 6: Consider f,(x) = nrezp(—nz),x > 0. However, deciding an appropriate limit 
function is not an easy task. Thus, we plot the function for different values of n on the same 


graph paper to get an idea. The plot is given below. 


It is easy to observe that the curve attains a peak and then diminishes to zero. As 
we increase n, the point at which peak is attained becomes nearer to zero. This suggests 
to take the limit function as f(z) = 0,2 > 0. However finding a v(x,c) from the rela- 


tion |f,(z) — 0| < e is not an easy task. Again, exp(nx) = 1 + nz + (nz)?/2 + ... gives 


4 


0.3 


nxexp(-nx) 
0.2 
l 


0.1 


0.0 


Figure 2: Deciding limit function 


exp(nx) > (nx)?/2. Thus |fa(£)— 0| < e will be satisfied if we can find a v(x, e) satisfying 
2/(nz) < e. Now the latter inequality gives v(z,c) = [2/(xe)] 4- 1. Since we get a feasible 


choice, lim f(x) = 0, x > 0 pointwise. 


Example 7: Consider f (x) = I(|x| € n), where I is an indicator function. As earlier, 
the first task is to decide a limit function f. Intuitively, it is clear that as we increase n, 
the more of the real line will be covered and we get the value unity there. Thus we claim 
f(z) =1V a. Now |f,(x) — f(x)| = I(|z| > n). Clearly for e > 1, |f.(z) — f(x)| < € is 
satisfied for every n > 1. For e € (0,1), the above inequality is satisfied whenever |x| € n or 


equivalently n > v(z,c) = [|x|] +1. Hence lim f;(z) = 1 Vx pointwise. 


Large sample Inference: Module 7! 


What we provide in this module 


e Uniform convergence 
e Examples 


e A mathematical definition of uniform convergence with examples 


‘Co-Ordinator: Dr. Rahul Bhattacharya, Department of Statistics, Calcutta University 


1 Uniform convergence 


We have already discussed pointwise convergence of sequence of functions. It is seen that 
for pointwise convergence, it is necessary to get a feasible v(x, €). So the question is natural, 
whether we can find a v’ = v'(e) depending only on e such that the convergence holds. This 
leads to the concept of uniform convergence. 

We explain the concept through an example. Consider f,(r) = z"/n?,v € (—1,1). Then 
it is already shown that f,(x) — 0 pointwise on (-1,1). One can recall that, for the above 
example we get a choice v(r,c) = [1/4/(c)] + 1. Naturally such a choice is independent 
of x. Thus we get a common value of v(x,c) ensuring pointwise convergence. Hence the 
convergence is more that pointwise and termed uniform(in the sense that a uniform choice 


of v(x, €) exists). 


1.1 Uniform convergence-Definition 


Suppose {f,(x)} is a sequence of functions on A to R. Then {f,(x)} is said to converge 
uniformly on A if for any given e > 0, there exists an integer v’ = v'(e) such that for all 
xeEAandn>r', 


fs) — F(@)| < e 


is satisfied. One can use the symbol f, = f to indicate uniform convergence on A. It is easy 
to observe that uniform convergence implies pointwise convergence and uniform convergence 
on A implies that on B where B C A. 

Now we shall discuss such a concept graphically. Suppose 1 f,(x)} is a sequence of functions 
on A to R such that f, = f on A. Then uniform convergence f, = f can be thought of 
the existence of some v’, independent of x, such that the graph of fa will lie inside the band 


(f — €, f +e) for every n > v. 


10 


00 05 10 15 20 25 3.0 


Figure 1: Uniform Convergence 


1.2 Uniform Convergence: How to check in practice? 


Suppose using some intuitive method, a limit function f is decided and pointwise convergence 
is established. Also suppose we get some v(z,¢€) for e > 0 and x € A. Consider a,b € A, 
then for a given e > 0 we get real numbers v(a,¢) and v(b,c) such that |f,(a) — f(a)| < 
e V n > v(a,e) and |f,(b) — f(b)| < e V n > v(b,e) are satisfied. Naturally, both the 
inequalities are satisfied if n > maz(v(a,c),v(b,ce)). The above suggests that if a feasible 


v'(€) = sup,eg V(x, €) exists for some B C A, then f, 3 f on B. 


1.3 Examples 


Example 1: Consider f,(2) = z?/n,x € R. Then for fixed x, we find that lim f,(x) = 0 
and hence we claim f(x) — 0V x € R. In our previous module, we have determined v(x, €) = 
[z?/e] +1. Naturally supper v(x, €) is not finite but for example sup, co, v(x, €) = [a?/e] +1 


exists for any finite a > 0. Thus f, = 0 on (0,a) whereas f,, — 0 pointwise for x € R. 


Example 2: Consider f (x) = x",r € (0,1). Then using the methods previously 


discussed we find that lim f(x) = f(x) pointwise on (0,1) for f(x) = 0 V x € (0, 1).The 
following v(z,c) = [log(e)/log(x)] + 1 was derived. Now lim, ,1. v(z,€) = +00 and hence 
convergence is not uniform on (0,1). However, v(r,c) is increasing in x and hence if 0 < 


b < 1, then sup, <9») v(z, c) = v(b,c) is finite. Therefore, the convergence is uniform over 


(0,5), b < 1. 


Example 3: Consider f,(r) = x"/n?,x € (—1,1). Then we have seen that pointwise 
convergence to f(x) = 0 V x € R is satisfied whenever n > [1/4/(c)] + 1. Thus, in this case 
v(z,c) = [1/A/(e)] + 1, which is independent of x. Hence the convergence is uniform over 
R. Thus we observe from the previous examples that pointwise convergence on A does not 
necessarily imply uniform convergence on A. But we can get a B C A on which convergence 


is uniform. 


Example 4: Consider falx) = na/(x + n), x > 0. Then it is already shown that 
f(x) — x V x > 0 pointwise. The corresponding choice for v(z, €) came out as [e/2?] + 1. 
Since lim, ,9, €/x? does not exist, convergence can not be uniform over (0,00). However, 
1/z? is a decreasing function and hence sup,c p v(r,c) is finite for B = [a,b) with a > 0. 


Hence the convergence is uniform over [a, b) for some positive a. 


Example 5: Consider f a(x) = nz/(x? + n?),z > 0. hen it is shown that f, > f(x) 
for f(z) = 0 V x > 0 pointwise on (0,00). In addition, we get the corresponding choice 
v(x, €) = [e/z?] +1. Then just as in the previous example, it can be shown that convergence 
is uniform over [a,b) for some positive a. Similarly, uniform convergence for the other 


examples can also be investigated. 


1.4 Uniform convergence-A mathematical definition 


We have discussed so far uniform convergence from first principles. However, such a definition 
is not always easy to check and hence we need a simpler condition. Suppose {f,(x)} is 


a sequence of functions on A to R which converges pointwise to some f(x) on A. Then 


f(x) 3 f(x) on BC A iff 


M, = BED lfa(z) — f(z)| > 0 


as n — oo. 

Consequently, we have the following results on sum and product functions. 

The first result is on the uniform convergence of a sequence of functions, where the n th 
term is a sum of two. Formally if f(x) = f(x) and g,(x) = g(x) uniformly on A then 

(i) cfa(x) 3 cf (x) on A and 

(ii) falz) + gn(x) 3 f(x) + g(x) on A. 


The next result is on the uniform convergence of a sequence of functions, where the n th 
term is a product of two. Formally if f(x) 3 f(x) and g,(x) = g(x) uniformly on A and 
there exists M > 0 such that |f,(x)| € M and |g,(x)| € M then 


fn) One) = f(x)g(x) on A. 


1.5 Examples 


Example 1: Consider f(x) = nrezp(—nz),x > 0. We have already seen that f,(x) > 
f(x) pointwise on x > 0, where f(x) = 0,x > 0. Now M, = sup, .onzezxp(—nz) and a 
simple algebra shows that ngezgzp(—nz) has a unique maximum at r = 1/n. Thus M, = 
exp(—1)which is non zero and independent of n. Thus the convergence is not uniform over 
[0, o0). However, if we consider B = [1,00], then M, — 0 and hence the convergence is 


uniform on B. 


Example 2: Consider f(x) = z", xz € [0,1]. Define f(z) 20if0 € x « land f(x) 21 
for x = 1. Naturally the convergence holds trivially for x=0,1.However, we have already seen 
that fn(x) — f(x) pointwise on x € [0,1]. Now M, = sup,¢jo1) 2". Since x” is an increasing 


function, it has a unique maximum at x = 1. Thus M, = 1, which is different from zero and 


hence the convergence is not uniform over [0,1]. However, if we consider B = [0, a], a < 1, 


then Mn — 0 and hence the convergence is uniform on B. 


Example 3: Consider f,(x) = I(|x| € n), where I is an indicator function. Then it is 
already shown that lim f(x) = 1 Vz pointwise. Now |f,(x) — f(x)| = I(|x| > n). Now for 
every n > 1, I(|x| > n) is either zero or 1. Thus M,, = 1 ~ 0 and hence the convergence is 


not uniform. 


Example 4: Consider f,(x) = z?ezp(—nz),r € [0,1]. It is easy to observe that 
falx) — f(x) pointwise on [0, 1], where f(x) = 0, x € [0,1]. Now M, = sup,¢j91) z^ exp(—nz). 
A simple algebra shows that z?exp(—nz) has a unique maximum at x = 2/n. Thus 


Mn = +exp(—2) — 0 and hence the convergence is not uniform over [0, 1]. 


Large sample Inference: Module 8: 


What we provide in this module 


e Consequences of uniform convergence for sequence of functions 


Series of functions & their convergence 


Examples 


e Consequences of uniform convergence for series of functions 


1 Co-Ordinator: Dr. Rahul Bhattacharya, Department of Statistics, Calcutta University 


1 Consequences of uniform convergence 


Now we shall discuss few consequences of uniform convergence. In fact, uniform convergence 


allows exchangeability of different operations like limits, differentiation, integration, etc. 


1.1 Uniform convergence & limit interchanging 
If f, (x) = f(x) on A, where f, is continuous on A for every n. Then f is also continuous on 
A. Thus for any a € A, we have 


lim lim falx) = lim lim falx). 


wa n—oo n—»oo rz-—»a 


The implication of the above is that if the limits are not interchangeable, then uniform 
convergence does not hold. For example, consider f,(x) = x",x € [0, 1) then the pointwise 
limit is the function f(x) = 0,x € [0, 1). However limpo f, (zr) = 0. Thus lim, 44 f(x) # 


lim, soo lim; ,4 f;(z). Hence the convergence can not be uniform. 


1.2 Uniform convergence & differentiation 


If f(x) be a sequence of functions on [a,b] such that 
(i) for some zo € (a,b), f, (xo) > f(xo) and 
(ii) f’ (x) exists at all x € [a,b] and converges uniformly to some function g on [a,b]. 


Then falx) 3 f(x) on [a,b] and g(x) = f'(x) V x € [a,b]. That is 


lim W E lim falx) V x € [a,b]. 


n23o dr dx n->oo 


For better understanding, consider an example with f,(r) = |r| € 1. Define 


TUE 
f(x) =0,x € [-1, 1], then it can be shown that M, = sup;ei 1, |fn(2) — f(x)| = 1/2n ^ 0 
as n — oo. Thus the convergence is uniform. However f; (x) = n and hence for x = 0, 


14-n222) 
limpo d. (= 1. Thus lim, soe 4, (9196 f'(x) for 350. 


1.3 Uniform convergence & integration 


Suppose f(x) is a sequence of Riemann integrable functions on [a,b] such that falx) — f(x) 
on [a,b]. Then f is also Riemann integrable on [a,b] and 
b 


b 
lim fula)de = | lim f (x)da. 


n— o0 a 


For an example, consider f a(x) = nxexp(—nx”),x € [0,1]. Define f(x) = 0, x € [0,1], then 
it can be shown that M, = sup,ejo1) |fa(£) — f(x)| = yn/2 > oo as n — oo. Thus the 
convergence is not uniform. Thus i. fr(x)dx = (1 — exp(—n))/2 > 1/2 but i f(x)dx = 


0. However if we consider the interval [1,3], then the convergence is uniform and hence 


S? f. (x)dz > 0 = f? f(a)de. 


2 Series of functions-Motivation 


We have already discussed sequence of functions and their convergence. Like ordinary series, 
the idea of sequence of functions can also be extended to series of functions. For example, 
we have well known e series and log series. As earlier, we can have different concepts of 


convergence. 


Let {fn(x)} be a sequence of functions on A to R. Then fı+ f2+ fa4-.. is said to be a series 
or infinite series and is often denoted by $77 , f(x). Define the sequence of partial sums 
of the infinite series by s,(x) = 7; 4 fi (x). If the sequence {s„(x)} converges pointwise to 
some s(x) on A, the series 3 7 , f(x) is said to converge pointwise to s(x) on A. However, if 
the convergence is uniform to some s(x) on A, the series 3 77 , f,(x) converges uniformly to 
s(x) on A. f the series $77. , |f, (z)| converges for each x € A, $77. , f(x) is said to converge 


absolutely on A. 


2.1 Examples 


Example 1: Consider the series: x? 4 a | ie + ...., for x € [0,1]. Then s,(z) = 


D ay = (1+27) —(1+27)~"* is the n th partial sum. Thus we find that 


Oifr=0 


lim s„(x) 


1-4- z? if x € (0,1] 


Thus s,,(x) converges pointwise to s(x) = (12-22)I(0 < x < 1) with I as an indicator function. 
Hence the series converges pointwise to s(x) on [0,1]. However, s(x) is not continuous on 


[0,1] but each s,,(a) is continuous on [0,1] and hence the convergence is not uniform on [0,1]. 


Example 2: Consider the series: $77 ,z"(1— xz)", for x € [0,1]. Then s,(x) = 


Xi (1 2) =a2(1- 7) ee is the n th partial sum. Thus we find that 


s(x) = lim s (x) = Oifx=0,1 


z(l1—2c) . 
= ea ANA 


Then s„(x) converges pointwise to s(x) and hence the series converges pointwise to s(x) on 


(0,1). 


Example 3: Consider the series: $77 , Sin(nx)/n?, for x € R. Let us fix x at a 
and consider the convergence of 5, Sin(na)/n?. Now |Sin(na)/n?| < 1/n? for all n > 1. 
Hence by comparison test 5 ^, |Sin(na)|/n? converges for p > 1, that is the series converges 
absolutely. Since a is arbitrary, the series $77 , Sin(nx)/n? is absolutely convergent on R 
for p > 1. However, for uniform convergence, we need some simpler conditions or tests like 


ordinary series. 


3  Weirstrass M test- Test for convergence 


Let f,(r) be a sequence of functions on A such that for every n > 1, |f,(z)| € M, for any 
x € A. Then the series 5 5, f(x) converges absolutely and uniformly if the series $^, Mn is 
convergent. 

The above test only provides a sufficient condition and hence it may be possible that $7, f, (x) 
converges uniformly but 7, Mn» diverges for best possible choice of M,,. However, the best 
possible choice for M,, is given by 


Mn = sup WE). 


rcA 


3.1 Examples 


Example 1: Consider the series: $77 , Cos(nz)/n?, for x € R. It is well known that 
ICos(nz)/n?| € 1/n? for all n > 1 and x € R. Thus we define M, = 1/n?. Since the series 


$5, Mn converges for p > 1, the given series converges absolutely and uniformly on z € R. 
Example 2: Consider the series: 357 , z?/(1 + n?z?), for x € R. Now 
a? /(1 + n?z?) = n?z?/(1-- n?z?)n ? <n? Va ER. 


Thus we take M, = 1/n?. Since the series $7, Mp converges, the given series converges 


absolutely and uniformly on x € R. 


Example 3: Consider the series: $77 , x/(n + n?z?), for x € R. Here falx) = x/(n + 
n?z?) and hence |f, (x)| = |x|/(n 4- n?z?) is a symmetric function about 0. Thus we consider 
only x > 0. Now a simple algebra shows that x/(n + n?zx?),r > 0 attains maximum at 
x = 1/ n. Thus we take Mp = 1/(2n??). Since $`, Mn converges, the given series converges 


absolutely and uniformly on x € R. 


Example 4: Consider the series: $77 , x/(n + n?z?), for x € R. Here f,(z) = x/(n + 


n?z?) and hence |f,(x)| = |v|/(n + n?z?) is a symmetric function about 0. Now it is well 


known that n + n?z? > 2n?|r|, which implies that |f,(x)| < = for x € R. Thus we take 


2n? 


M,, = 1/(2n?). Hence the given series converges absolutely and uniformly on x € R. 


4 Consequences of uniform convergence for series 


We have already discussed different consequences of uniform convergence for sequence of 
functions regarding exchangeability of different mathematical operations like limits, differ- 
entiation and integration. Now we shall extend those ideas for a uniformly convergent series 


of functions. 


4.1 Uniform convergence & continuity 


Consider a series of functions »7, f(x). Suppose 


(i) f&(x) is continuous on A for every n > 1 and 


(ii) Do fa(x) 3 f(x) on A. 


Then f(x) is continuous on A. 


4.2 Interchangeability of summation & limit 


Suppose »7 falx) 3 f(x) on A. Let zo be any point in A. Then lim, ,4, f(x) exists and 
PM D E 


Thus the operations of summation and limit are interchangeable under uniform convergence. 


4.3 Interchangeability of differentiation & limit 


Consider a series of functions 5, f(x), x € [a,b] such that 
(i) 55, fa(xo) converges to f(xo) for some zo € A, 
(ii) f, (x) exists at all x € (a,b) and (iii) $7, f; (x) converges uniformly on [a,b]. 


Then £35 fale) =, £ fala). 


4.4  Interchangeability of integration & limit 


Consider a series of functions $7, f,(r), x € [a,b] such that 
(i) each f,(x) is Riemann integrable on [a,b] and 
(ii) $5, f, (x) converges uniformly on [a,b]. 


Then >, f? f.(z)dz — f^ Y5,, fa (z)dz. 


Large sample Inference: Module 9! 


What we provide in this module 


e Power series 
e Convergence & applications 


e Taylor series & applications 


‘Co-Ordinator: Dr. Rahul Bhattacharya, Department of Statistics, Calcutta University 


1 Power series 


Power series is a particular type of series of functions. Most of the discrete distributions like 
Binomial, Poisson, Negative binomial are actually a convergent power series. In addition, 
we use different generating functions, which are also power series. Thus existence of some 
generating function can be justified from the convergence properties of power series. We 
often deduce moments from the moment generating function using successive differentiation. 


Convergence of power series validates such operations. 


1.1 Power series: Definition& Convergence 


The series of functions $77. o f,(x),x € A is said to be a power series around c if f(x) = 
aj, (x — c)",n > 0. For simplicity we take c=0. Naturally a power series does not always con- 
verge for every x. However, it always converges to ag for x = c. For example, $77. 4 x^ / (n!) 


converges for every real x, whereas the series * 7  n!z" converges only for x=0. 


Now we shall discuss convergence of a power series. Consider the power series X 77 , a, (z— 
c)", r € A. Assume that one of the following limits exist(may be infinite) 
(i) lim 3534/05] = 17 R 
(ii) lim |a,|V^ = 1/R 
Then the power series converges absolutely for all x such that |x — c| « R. The power series 
diverges for all x such that |x — c| > R. In addition, the power series converges uniformly 
and absolutely for every compact subset of (—R, R). 
R as defined above is called the radius of convergence of the concerned power series. The 
interval (— R, R) is called the interval of convergence of the power series. If R=0, we conclude 
that the power series is nowhere convergent. However, if R = oo, the power series is said to 
be convergent everywhere. Every power series can be differentiated term by term within its 
interval of convergence. Every power series can also be integrated term by term within its 


interval of convergence. 


1.2 Examples 


Example 1: Consider the power series: z + z? + 3? + x 4 ..... Naturally this is the so 


called geometric series. However, one can alternatively interpret this as a power series with 
a, —1Vmn 1. Now lim |a, |^ = 1 and hence the power series converges absolutely for all 
|z| < 1. Thus the radius of convergence is R = 1 and interval of convergence (-1,1). It is 
already noted that the series diverges for r = +1 and hence (-1,1) is the exact interval of 


convergence. 


Example 2: Consider the power series: 3 7 ,(—2)"/n. This is the so called log series 
as well as a power series with a, = (—1)"/n V n > 1. Now lim |a,|!/" = 1 as limn!” = 1 and 
hence the power series converges absolutely for all |z| « 1. Thus the radius of convergence is 
R = 1 and interval of convergence (-1,1). However, for x = —1, we get the so called harmonic 
series which diverges. Again for x = +1 we get the alternating series which converges. Thus 


we get (-1,1] is the exact interval of convergence. 


Example 3: Consider the power series: $77 ,(x)"/n!. This is the well known e series 
and is also a power series with a, = 1/n! V n > 1. Now lim |an+1/an| = 0 and hence the 
power series converges absolutely for all x € (—00,06). Thus the radius of convergence in 


this case is R = oo and interval of convergence is the whole real line. 


Example 4: Consider the power series: $75 ,n!(r)". This is a power series with 
an — n! V n > 1. Now lim |an41/an| = oo and hence the power series does not converge for 


any x € (—00,00) except for x=0. The radius of convergence in this case is R = 0. 


Example 5: Consider the power series: $77 (x + 2)"/log(n). This is a power series 
centered at c—-2 with a, = 1/log(n) V n > 2. Now lim |a4,1/a,| = 1 and hence the power 
series converges for any |x + 2| < 1 or for —3 < x < —1. The radius of convergence in 
this case is R = 1 with interval of convergence as (-3,-1). However, for x=-3, we get the 
series 5 '(—1)"/log(n). Now, 1/log(n), n > 2 is a decreasing sequence with limit zero. Hence 


by Leibnitz test the series converges. For x=-1, we get the series ` 1/log(n) , which by 


condensation test diverges. Hence the exact interval of convergence is |—3, —1). 


Example 6: Consider the power series: $77. , n!(a — 1)"/n". This is a power series 
centered at c=1 with a, = n!/n" V n > 2. Now lim |a,,1/a4,| = e and hence the power 
series converge for any x satisfying |x — 1| < e or for -e+ 1 < x < e- 1. The radius 
of convergence in this case is R = e with interval of convergence as (-e+1l,e+1). However, 
separate examination on the boundary can be carried out to know the exact interval of 


convergence. 


Example 7: Consider the hypergeometric series: 


RP s ala + 1)5(8 +1) 2 
Ly | 1.2.4(7+1) 


a(a+])...(a+n—-1)6(8+1)...(8+n-1) 


with a, 8, y and x are all positive. This is a power series with a, = IUNCTA VETT REY 
2. Then we get lim |an+1/an| = 1 and hence the power series converges for any x satisfying 
|x| < 1. The radius of convergence in this case is R = 1 with interval of convergence as 


(-1,1). However, convergence on the boundary x = +1 depends on a, 3 and s. 


2 Taylor series expansion 


Taylor series expansion is one of the most useful tools for a statistician. A Taylor series 
provides a power series representation of a function using its higher order derivatives. It is 
extremely useful for obtaining the large sample distribution of different implicit / explicit 


functions of relevant statistics. We start with the formal definition of a Taylor Series. 


2.1 Taylor's Theorem: Univariate case 


Let f(x),x € [a,b] be a function defined on the real line such that 
(i) f(z) is continuous on [a,b] for all k < n — 1 and 


(ii) f? (x) exists on [a,b]. 


ne 


Then for any 2 € [a, 0] we have 
f(x) = P.(x) + Rs (x), 


where P,(z) = Y, 9 f P (z)(z — zo)*/k! with f(x) = f(x) and R,(z) = "P(o (£ — 
z9)"*!/(n +1)! where c = tz + (1 — t)zg for t € [0,1]. 


Note that P, is called the n th Taylor polynomial for f at xy and it is a reasonable 
approximation to f for points near ro. Rn is called the remainder term in Lagrange’s form. 
If lim R,(x) = 0 for all z € (a,b), then lim P,(z) = f(z)V x. Thus f admits of the Taylor 
series expansion f(x) = EX, f(a)(x — xo)*/k!. If zo = 0, the above is called Maclaurin 


expansion of f(x). 


However convergence is difficult to check for a Taylor series and hence we need some 
sufficient conditions. Consider the n th term, which is f"? (z)(z — zo)" /n!. Now 35» q(x — 
zo)" /n! converges for all x. Thus if we can show that f(x) is bounded by some finite 
number M for all x in some interval around xg, the convergence can be asserted for that 


interval. 
2.1.1 Examples 


Example 1: Consider f(x) = sin(x) and assume zy = 0. Then f? (zx) = Sin(nz/2 + x) 
Thus R,(x) = Sin(n + 11/2 + c)z"*!/(n + 1)!. Now it is easy to observe that |Sin(nm/2 + 
x)| < 1 for any n > 1, and hence by the condition defined above, the Taylor series converges 


to the sine function on the real line. The same is valid for cosine function. 


Example 2: Consider f(x) = log(1 + x) and assume zy = 0. Since f(x) = 
(—1)"71(n — 1)!(1 + x)”, the Taylor series around 0 is 377 4(—1)^^!z^/n. Suppose we 
wish to know whether the above series converges for x=1. Now R,(x) = (—1)"(n)!(1 + 
c) "Ig" tl/(n-F1)5c € (0,1). Since |R,(1)| € 1/n — 0, the series converges. Thus we get 
the representation log(2) = 5^ ,(—1)"^1/mn. 


n=) 


2.2 Taylor's Theorem: Multivariate case 


Let f = f(zxi,225,..,2,) be a function of s variables defined on A € R*. Define the gradi- 
ent operator as V — (2, o ET x) . Also define the partial differential operator z/V = 


* (v; 3-. The operator (z'V)^ is used as the k th power of a sum. 
i—1 “1 Ox; 


Multivariate Taylor's Theorem: First order: Let f = f(z1, x5, .., £s) be a function 
of s variables defined on A C Rê. Suppose there exists a neighbourhood of a in A such that 
partial derivatives up to order (l + 1) are continuous in the neighbourhood. Then for any x 
belonging to that neighbourhood, 


{æ -ay yj 
e «ai 


f(2) - fla) - 93 


k=1 


((z — yv 
as f(2), 


where z = ta + (1 — t)z for some t € [0, 1]. 


Multivariate Taylor's Theorem: Second order: However, the second order ex- 


pansion is often expressed in terms of Hessian matrix H(z) = 9 f) as f(x) = f(a) + 


da lz-a--i(x — a) H(z)(x — a), where z = ta + (1 — t)z for some t € [0, 1]. 


Multivariate Taylor's Theorem using © and o: If all the partial derivatives of f 


up to order (l+ 1) are bounded in the neighbourhood of a, we have either 


fG) - fla) = SAE ONE p) + olle- al), 
fæ- fla) = E— 2 VT p) + Ole- al", 


where |x — a| = / 5; 4 (vi — a)?. 


Large sample Inference: Module 8: 


What we provide in this module 


e Consequences of uniform convergence for sequence of functions 


Series of functions & their convergence 


Examples 


e Consequences of uniform convergence for series of functions 


1 Co-Ordinator: Dr. Rahul Bhattacharya, Department of Statistics, Calcutta University 


1 Consequences of uniform convergence 


Now we shall discuss few consequences of uniform convergence. In fact, uniform convergence 


allows exchangeability of different operations like limits, differentiation, integration, etc. 


1.1 Uniform convergence & limit interchanging 
If f, (x) = f(x) on A, where f, is continuous on A for every n. Then f is also continuous on 
A. Thus for any a € A, we have 


lim lim falx) = lim lim falx). 


wa n—oo n—»oo rz-—»a 


The implication of the above is that if the limits are not interchangeable, then uniform 
convergence does not hold. For example, consider f,(x) = x",x € [0, 1) then the pointwise 
limit is the function f(x) = 0,x € [0, 1). However limpo f, (zr) = 0. Thus lim, 44 f(x) # 


lim, soo lim; ,4 f;(z). Hence the convergence can not be uniform. 


1.2 Uniform convergence & differentiation 


If f(x) be a sequence of functions on [a,b] such that 
(i) for some zo € (a,b), f, (xo) > f(xo) and 
(ii) f’ (x) exists at all x € [a,b] and converges uniformly to some function g on [a,b]. 


Then falx) 3 f(x) on [a,b] and g(x) = f'(x) V x € [a,b]. That is 


lim W E lim falx) V x € [a,b]. 


n23o dr dx n->oo 


For better understanding, consider an example with f,(r) = |r| € 1. Define 


TUE 
f(x) =0,x € [-1, 1], then it can be shown that M, = sup;ei 1, |fn(2) — f(x)| = 1/2n ^ 0 
as n — oo. Thus the convergence is uniform. However f; (x) = n and hence for x = 0, 


14-n222) 
limpo d. (= 1. Thus lim, soe 4, (9196 f'(x) for 350. 


1.3 Uniform convergence & integration 


Suppose f(x) is a sequence of Riemann integrable functions on [a,b] such that falx) — f(x) 
on [a,b]. Then f is also Riemann integrable on [a,b] and 
b 


b 
lim fula)de = | lim f (x)da. 


n— o0 a 


For an example, consider f a(x) = nxexp(—nx”),x € [0,1]. Define f(x) = 0, x € [0,1], then 
it can be shown that M, = sup,ejo1) |fa(£) — f(x)| = yn/2 > oo as n — oo. Thus the 
convergence is not uniform. Thus i. fr(x)dx = (1 — exp(—n))/2 > 1/2 but i f(x)dx = 


0. However if we consider the interval [1,3], then the convergence is uniform and hence 


S? f. (x)dz > 0 = f? f(a)de. 


2 Series of functions-Motivation 


We have already discussed sequence of functions and their convergence. Like ordinary series, 
the idea of sequence of functions can also be extended to series of functions. For example, 
we have well known e series and log series. As earlier, we can have different concepts of 


convergence. 


Let {fn(x)} be a sequence of functions on A to R. Then fı+ f2+ fa4-.. is said to be a series 
or infinite series and is often denoted by $77 , f(x). Define the sequence of partial sums 
of the infinite series by s,(x) = 7; 4 fi (x). If the sequence {s„(x)} converges pointwise to 
some s(x) on A, the series 3 7 , f(x) is said to converge pointwise to s(x) on A. However, if 
the convergence is uniform to some s(x) on A, the series 3 77 , f,(x) converges uniformly to 
s(x) on A. f the series $77. , |f, (z)| converges for each x € A, $77. , f(x) is said to converge 


absolutely on A. 


2.1 Examples 


Example 1: Consider the series: x? 4 a | ie + ...., for x € [0,1]. Then s,(z) = 


D ay = (1+27) —(1+27)~"* is the n th partial sum. Thus we find that 


Oifr=0 


lim s„(x) 


1-4- z? if x € (0,1] 


Thus s,,(x) converges pointwise to s(x) = (12-22)I(0 < x < 1) with I as an indicator function. 
Hence the series converges pointwise to s(x) on [0,1]. However, s(x) is not continuous on 


[0,1] but each s,,(a) is continuous on [0,1] and hence the convergence is not uniform on [0,1]. 


Example 2: Consider the series: $77 ,z"(1— xz)", for x € [0,1]. Then s,(x) = 


Xi (1 2) =a2(1- 7) ee is the n th partial sum. Thus we find that 


s(x) = lim s (x) = Oifx=0,1 


z(l1—2c) . 
= ea ANA 


Then s„(x) converges pointwise to s(x) and hence the series converges pointwise to s(x) on 


(0,1). 


Example 3: Consider the series: $77 , Sin(nx)/n?, for x € R. Let us fix x at a 
and consider the convergence of 5, Sin(na)/n?. Now |Sin(na)/n?| < 1/n? for all n > 1. 
Hence by comparison test 5 ^, |Sin(na)|/n? converges for p > 1, that is the series converges 
absolutely. Since a is arbitrary, the series $77 , Sin(nx)/n? is absolutely convergent on R 
for p > 1. However, for uniform convergence, we need some simpler conditions or tests like 


ordinary series. 


3  Weirstrass M test- Test for convergence 


Let f,(r) be a sequence of functions on A such that for every n > 1, |f,(z)| € M, for any 
x € A. Then the series 5 5, f(x) converges absolutely and uniformly if the series $^, Mn is 
convergent. 

The above test only provides a sufficient condition and hence it may be possible that $7, f, (x) 
converges uniformly but 7, Mn» diverges for best possible choice of M,,. However, the best 
possible choice for M,, is given by 


Mn = sup WE). 


rcA 


3.1 Examples 


Example 1: Consider the series: $77 , Cos(nz)/n?, for x € R. It is well known that 
ICos(nz)/n?| € 1/n? for all n > 1 and x € R. Thus we define M, = 1/n?. Since the series 


$5, Mn converges for p > 1, the given series converges absolutely and uniformly on z € R. 
Example 2: Consider the series: 357 , z?/(1 + n?z?), for x € R. Now 
a? /(1 + n?z?) = n?z?/(1-- n?z?)n ? <n? Va ER. 


Thus we take M, = 1/n?. Since the series $7, Mp converges, the given series converges 


absolutely and uniformly on x € R. 


Example 3: Consider the series: $77 , x/(n + n?z?), for x € R. Here falx) = x/(n + 
n?z?) and hence |f, (x)| = |x|/(n 4- n?z?) is a symmetric function about 0. Thus we consider 
only x > 0. Now a simple algebra shows that x/(n + n?zx?),r > 0 attains maximum at 
x = 1/ n. Thus we take Mp = 1/(2n??). Since $`, Mn converges, the given series converges 


absolutely and uniformly on x € R. 


Example 4: Consider the series: $77 , x/(n + n?z?), for x € R. Here f,(z) = x/(n + 


n?z?) and hence |f,(x)| = |v|/(n + n?z?) is a symmetric function about 0. Now it is well 


known that n + n?z? > 2n?|r|, which implies that |f,(x)| < = for x € R. Thus we take 


2n? 


M,, = 1/(2n?). Hence the given series converges absolutely and uniformly on x € R. 


4 Consequences of uniform convergence for series 


We have already discussed different consequences of uniform convergence for sequence of 
functions regarding exchangeability of different mathematical operations like limits, differ- 
entiation and integration. Now we shall extend those ideas for a uniformly convergent series 


of functions. 


4.1 Uniform convergence & continuity 


Consider a series of functions »7, f(x). Suppose 


(i) f&(x) is continuous on A for every n > 1 and 


(ii) Do fa(x) 3 f(x) on A. 


Then f(x) is continuous on A. 


4.2 Interchangeability of summation & limit 


Suppose »7 falx) 3 f(x) on A. Let zo be any point in A. Then lim, ,4, f(x) exists and 
PM D E 


Thus the operations of summation and limit are interchangeable under uniform convergence. 


4.3 Interchangeability of differentiation & limit 


Consider a series of functions 5, f(x), x € [a,b] such that 
(i) 55, fa(xo) converges to f(xo) for some zo € A, 
(ii) f, (x) exists at all x € (a,b) and (iii) $7, f; (x) converges uniformly on [a,b]. 


Then £35 fale) =, £ fala). 


4.4  Interchangeability of integration & limit 


Consider a series of functions $7, f,(r), x € [a,b] such that 
(i) each f,(x) is Riemann integrable on [a,b] and 
(ii) $5, f, (x) converges uniformly on [a,b]. 


Then >, f? f.(z)dz — f^ Y5,, fa (z)dz. 


Large sample Inference: Module 9! 


What we provide in this module 


e Power series 
e Convergence & applications 


e Taylor series & applications 


‘Co-Ordinator: Dr. Rahul Bhattacharya, Department of Statistics, Calcutta University 


1 Power series 


Power series is a particular type of series of functions. Most of the discrete distributions like 
Binomial, Poisson, Negative binomial are actually a convergent power series. In addition, 
we use different generating functions, which are also power series. Thus existence of some 
generating function can be justified from the convergence properties of power series. We 
often deduce moments from the moment generating function using successive differentiation. 


Convergence of power series validates such operations. 


1.1 Power series: Definition& Convergence 


The series of functions $77. o f,(x),x € A is said to be a power series around c if f(x) = 
aj, (x — c)",n > 0. For simplicity we take c=0. Naturally a power series does not always con- 
verge for every x. However, it always converges to ag for x = c. For example, $77. 4 x^ / (n!) 


converges for every real x, whereas the series * 7  n!z" converges only for x=0. 


Now we shall discuss convergence of a power series. Consider the power series X 77 , a, (z— 
c)", r € A. Assume that one of the following limits exist(may be infinite) 
(i) lim 3534/05] = 17 R 
(ii) lim |a,|V^ = 1/R 
Then the power series converges absolutely for all x such that |x — c| « R. The power series 
diverges for all x such that |x — c| > R. In addition, the power series converges uniformly 
and absolutely for every compact subset of (—R, R). 
R as defined above is called the radius of convergence of the concerned power series. The 
interval (— R, R) is called the interval of convergence of the power series. If R=0, we conclude 
that the power series is nowhere convergent. However, if R = oo, the power series is said to 
be convergent everywhere. Every power series can be differentiated term by term within its 
interval of convergence. Every power series can also be integrated term by term within its 


interval of convergence. 


1.2 Examples 


Example 1: Consider the power series: z + z? + 3? + x 4 ..... Naturally this is the so 


called geometric series. However, one can alternatively interpret this as a power series with 
a, —1Vmn 1. Now lim |a, |^ = 1 and hence the power series converges absolutely for all 
|z| < 1. Thus the radius of convergence is R = 1 and interval of convergence (-1,1). It is 
already noted that the series diverges for r = +1 and hence (-1,1) is the exact interval of 


convergence. 


Example 2: Consider the power series: 3 7 ,(—2)"/n. This is the so called log series 
as well as a power series with a, = (—1)"/n V n > 1. Now lim |a,|!/" = 1 as limn!” = 1 and 
hence the power series converges absolutely for all |z| « 1. Thus the radius of convergence is 
R = 1 and interval of convergence (-1,1). However, for x = —1, we get the so called harmonic 
series which diverges. Again for x = +1 we get the alternating series which converges. Thus 


we get (-1,1] is the exact interval of convergence. 


Example 3: Consider the power series: $77 ,(x)"/n!. This is the well known e series 
and is also a power series with a, = 1/n! V n > 1. Now lim |an+1/an| = 0 and hence the 
power series converges absolutely for all x € (—00,06). Thus the radius of convergence in 


this case is R = oo and interval of convergence is the whole real line. 


Example 4: Consider the power series: $75 ,n!(r)". This is a power series with 
an — n! V n > 1. Now lim |an41/an| = oo and hence the power series does not converge for 


any x € (—00,00) except for x=0. The radius of convergence in this case is R = 0. 


Example 5: Consider the power series: $77 (x + 2)"/log(n). This is a power series 
centered at c—-2 with a, = 1/log(n) V n > 2. Now lim |a4,1/a,| = 1 and hence the power 
series converges for any |x + 2| < 1 or for —3 < x < —1. The radius of convergence in 
this case is R = 1 with interval of convergence as (-3,-1). However, for x=-3, we get the 
series 5 '(—1)"/log(n). Now, 1/log(n), n > 2 is a decreasing sequence with limit zero. Hence 


by Leibnitz test the series converges. For x=-1, we get the series ` 1/log(n) , which by 


condensation test diverges. Hence the exact interval of convergence is |—3, —1). 


Example 6: Consider the power series: $77. , n!(a — 1)"/n". This is a power series 
centered at c=1 with a, = n!/n" V n > 2. Now lim |a,,1/a4,| = e and hence the power 
series converge for any x satisfying |x — 1| < e or for -e+ 1 < x < e- 1. The radius 
of convergence in this case is R = e with interval of convergence as (-e+1l,e+1). However, 
separate examination on the boundary can be carried out to know the exact interval of 


convergence. 


Example 7: Consider the hypergeometric series: 


RP s ala + 1)5(8 +1) 2 
Ly | 1.2.4(7+1) 


a(a+])...(a+n—-1)6(8+1)...(8+n-1) 


with a, 8, y and x are all positive. This is a power series with a, = IUNCTA VETT REY 
2. Then we get lim |an+1/an| = 1 and hence the power series converges for any x satisfying 
|x| < 1. The radius of convergence in this case is R = 1 with interval of convergence as 


(-1,1). However, convergence on the boundary x = +1 depends on a, 3 and s. 


2 Taylor series expansion 


Taylor series expansion is one of the most useful tools for a statistician. A Taylor series 
provides a power series representation of a function using its higher order derivatives. It is 
extremely useful for obtaining the large sample distribution of different implicit / explicit 


functions of relevant statistics. We start with the formal definition of a Taylor Series. 


2.1 Taylor's Theorem: Univariate case 


Let f(x),x € [a,b] be a function defined on the real line such that 
(i) f(z) is continuous on [a,b] for all k < n — 1 and 


(ii) f? (x) exists on [a,b]. 


ne 


Then for any 2 € [a, 0] we have 
f(x) = P.(x) + Rs (x), 


where P,(z) = Y, 9 f P (z)(z — zo)*/k! with f(x) = f(x) and R,(z) = "P(o (£ — 
z9)"*!/(n +1)! where c = tz + (1 — t)zg for t € [0,1]. 


Note that P, is called the n th Taylor polynomial for f at xy and it is a reasonable 
approximation to f for points near ro. Rn is called the remainder term in Lagrange’s form. 
If lim R,(x) = 0 for all z € (a,b), then lim P,(z) = f(z)V x. Thus f admits of the Taylor 
series expansion f(x) = EX, f(a)(x — xo)*/k!. If zo = 0, the above is called Maclaurin 


expansion of f(x). 


However convergence is difficult to check for a Taylor series and hence we need some 
sufficient conditions. Consider the n th term, which is f"? (z)(z — zo)" /n!. Now 35» q(x — 
zo)" /n! converges for all x. Thus if we can show that f(x) is bounded by some finite 
number M for all x in some interval around xg, the convergence can be asserted for that 


interval. 
2.1.1 Examples 


Example 1: Consider f(x) = sin(x) and assume zy = 0. Then f? (zx) = Sin(nz/2 + x) 
Thus R,(x) = Sin(n + 11/2 + c)z"*!/(n + 1)!. Now it is easy to observe that |Sin(nm/2 + 
x)| < 1 for any n > 1, and hence by the condition defined above, the Taylor series converges 


to the sine function on the real line. The same is valid for cosine function. 


Example 2: Consider f(x) = log(1 + x) and assume zy = 0. Since f(x) = 
(—1)"71(n — 1)!(1 + x)”, the Taylor series around 0 is 377 4(—1)^^!z^/n. Suppose we 
wish to know whether the above series converges for x=1. Now R,(x) = (—1)"(n)!(1 + 
c) "Ig" tl/(n-F1)5c € (0,1). Since |R,(1)| € 1/n — 0, the series converges. Thus we get 
the representation log(2) = 5^ ,(—1)"^1/mn. 


n=) 


2.2 Taylor's Theorem: Multivariate case 


Let f = f(zxi,225,..,2,) be a function of s variables defined on A € R*. Define the gradi- 
ent operator as V — (2, o ET x) . Also define the partial differential operator z/V = 


* (v; 3-. The operator (z'V)^ is used as the k th power of a sum. 
i—1 “1 Ox; 


Multivariate Taylor's Theorem: First order: Let f = f(z1, x5, .., £s) be a function 
of s variables defined on A C Rê. Suppose there exists a neighbourhood of a in A such that 
partial derivatives up to order (l + 1) are continuous in the neighbourhood. Then for any x 
belonging to that neighbourhood, 


{æ -ay yj 
e «ai 


f(2) - fla) - 93 


k=1 


((z — yv 
as f(2), 


where z = ta + (1 — t)z for some t € [0, 1]. 


Multivariate Taylor's Theorem: Second order: However, the second order ex- 


pansion is often expressed in terms of Hessian matrix H(z) = 9 f) as f(x) = f(a) + 


da lz-a--i(x — a) H(z)(x — a), where z = ta + (1 — t)z for some t € [0, 1]. 


Multivariate Taylor's Theorem using © and o: If all the partial derivatives of f 


up to order (l+ 1) are bounded in the neighbourhood of a, we have either 


fG) - fla) = SAE ONE p) + olle- al), 
fæ- fla) = E— 2 VT p) + Ole- al", 


where |x — a| = / 5; 4 (vi — a)?. 


Large sample Inference: Module 10: 


What we provide in this module 


e Convergence in probability 
e Convergence in distributions 


e Applications 


1 Co-Ordinator: Dr. Rahul Bhattacharya, Department of Statistics, Calcutta University 


1 Convergence Of A Sequence Of Random Variables 


Convergence of a sequence of random variables is the central concept in Asymptotic Theory. 
Two types of approximations are of great importance in statistics. The first type approx- 
imates a given distribution function by the another. In the second type, a given random 
variable is approximated by another random variable. 

Corresponding to the first type, we have convergence in distributions. However, connected 


with the latter type, we identify three distinct types of convergence,namely 
convergence in probability 
almost sure convergence 


convergence in the r th mean. 


2 Convergence in distributions 


2.1 Basic ideas 


The pivotal quantity in convergence in distributions, is the idea of a distribution function. 
So we provide some basics related to distribution functions. For a random variable X, its 


cdf is defined as F(x) = P(X € x),x € R. Then it is well known that 
F is non-decreasing 
F is right continuous, i.e. F(x 4-0) = F(z)Vx 
F(co) 2 1 and F(—o0) = 0 


The last condition ensures that F is proper. 
Let X, be a sequence of random variables. Suppose £F, is the corresponding sequence of 
distribution functions, ie. X, ~ F,(x). We want to investigate the convergence of F, to 


some limiting cdf. Consider an example, where P(X, = n) =1Vn. Then F,(r) = I(x > n), 


which is a sequence of functions. Thus the pointwise limit is simply: F,(r) + OV x. Hence 
the pointwise limit is not a proper cdf. 
However, for a statistician, only convergence to a cdf is of interest and hence we should define 


convergence such that F,(z) > F(x), for a cdf F(x), which may not be proper. 


2.2 Weak convergence & convergence in distributions 


Consider a sequence of distribution functions F,(z). Then F, > F, (i.e. converges weakly 
to F) if F,(z) — F(x) for all x, where F(x) is continuous. 

For weak convergence, we do not need a proper cdf F. If in addition, F is proper, we have 
convergence in distribution. 

Consider a sequence of random variables X,, and distribution functions F,(r) with X, ~ Fn- 
Then X, AX , (ie. converges in distribution or law) if there exists a random variable X 


with cdf F(x) such that F(x) — F(x) for all x, where F(x) is continuous. 


2.3 Examples 


Example 1: Suppose P(X, = a + 1/n) = 1Vn > 1. Then one would expect convergence 
to a random variable X such that P(X = a) = 1. However, F,(x) = I(x > a+ 1/n). 
Thus lim F,(z) = I(x > a) Hence, if we define F(x) = I(x > a), we find lim F,(a) = 0 
but F(a) = 1. Naturally, if we do not restrict to continuity points of F, convergence is not 
possible. However, a is not a continuity point of F and hence according to the definition 


XX 


Example 2: Suppose X, ~ N(0,1 + 1/n?)Vn > 1. Then we could expect that Xn 
converges to a standard normal variable. For better understanding, we plot the distribution 


of X,, for different values of n. 


Naturally, we find that for increasing values of n, the graph approaches to the density of 


Figure 1: Convergence in distribution 


a standard normal variable. However, F,(r) = ®( 53). Since, [5 — Tas n — oo, we 


have F,,(x) — (x) pointwise on R. Hence, X, converges to a standard normal variable. 


Example 3: Suppose X, ~ N(0,n?)Vn > 1. Then we have F,(r) = $(r/n)Vz, so 
that lim F,(x) = 1. Thus, if we define F(x) = 2, we find F(oo) = 1/2 4 1. Hence F is not 
proper so that F, = F. Naturally, we can't identify any rv X having df F and hence we do 


not have convergence in distributions. 


Example 4: Suppose X, ~ N(0, 1/n?)Vn > 1. Then we have F,(r) = $(nz)Vz, so 
that lim Fa(x) = I(x > 0) + $1(x = 0). Thus, if we define F(x) = I(x > 0) + 1I(x = 0), 
we find that F is not right continuous at 0. But 0 is a point of discontinuity of F. Thus 
Xn E: X, where P(X = 0) = 1. The above can be justified from the graph in the next page. 


0.4 


0.3 
| 


0.1 


0.0 
l 


Figure 2: Convergence in distribution 


2.4 Convergence in distributions: Uniform Convergence 


The definitions of convergence in distribution is based on the concept of pointwise conver- 
gence of sequence of distribution functions. However, bounded and non-decreasing property 
of distribution function enables uniform convergence. The following theorem ensures uniform 


convergence. 


Polya Theorem: Consider a sequence of distribution functions F, and another cdf 


F(x). If F,(x) — F(x) for all x, where F(x) is continuous then 
sup |F;(z) — F(x)| > 0 


as n — oo. 


3 Convergence in probability 


For an ordinary sequence ap, n > 1 converging to a, it is already noted that for given e > 0, 


lan — a| < € is satisfied for large n. However, for a sequence of random variables, such an 


5 


inequality may not hold with probability one. But we can made the corresponding probabil- 
ity arbitrarily close to unity for appropriately chosen limit. In particular the corresponding 
probability sequence is an ordinary sequence and the above suggests that it converges to 


unity provided the limit random variable is chosen appropriately. 
Convergence in probability: Consider a sequence of random variables X4. Then X, 
is said to converge in probability to another random variable X if for every given e > 0, 
lim P(d(X,, X) € €) = 1 & lim P(d(X,, X) > €) 2 0, 


where d( X,, X) = |X, — X| is the distance function and the random variables X,,n > 1 and 
X are defined on à common sample space. For random vectors, one can use the Euclidean 


norm 


d(Xs, X) =|| Xn — X |= V (Xs — X) (Xn — X) 


We use the notation X, — X zy 0, for stochastic X and X, . X for nonrandom X. 


3.1 An example 


Consider X, = X + ds. where X ~ N(0,1). Intuitively, it appears that for large n, the 
contribution of the second quantity is negligible and hence X, is very close to X. Now 
X,— X is degenerate at 1/n?. Then for any e > 1, P(|X, —X| > 6) 2 0. Again P(|X, -X| > 
c) = 1/n? or 0 as e < 1/n? or e > 1/n?. Thus for any e > 0, and P(|X, — X| > €) > 0 and 


hence X, AX 


3.2 Convergence in probability & distribution 


Convergence in probability implies convergence in distribution but the converse is true only 
for degenerate limits. As an example consider a random variable X having a N(0,1) dis- 


tribution and define X, = (—1)"X. Then due to symmetry X, E N(0,1) and hence 


Xn SY = N(0,1). But the distribution of |X, — Y| is independent of any n and hence 
P(|X, — Y| > €) does not tend to zero. Thus convergence in probability does not hold 


though convergence in distribution holds. 


Large sample Inference: Module 11: 


What we provide in this module 


Almost sure convergence 


e Weak law of large numbers 


Strong law of large numbers 


Borel-Cantelli Lemma 


1 Co-Ordinator: Dr. Rahul Bhattacharya, Department of Statistics, Calcutta University 


1 Convergence almost surely 


Convergence in probability does not ensure lim X, = X with probability 1 and hence we 


have another mode of convergence, known as almost sure convergence. 


Almost sure convergence: Consider a sequence of random variables X,. Then X, is 
said to converge almost surely to another random variable X . ie. X, — X 93 0 if 
P 
sup d( X4, X) > 0, 
mon 
where the random variables X,,n > 1 and X are defined on a common sample space. This 


is a stronger mode of convergence because sUPm>n d( X;, X) TEES d( X4, X) Ee. 


1.1 An example 

Consider X, ed R(0,1),n > 1. Then it is well known that E(X(1,5) = — — 0 for large n. 
Now consider the quantity P(|X(1,, —0| < e V m 2 n). Since Xam) € [0, 1] with probability 
one, we get |X(1:m) — 0| € [0,1] with probability one. Then for e > 1, the above probability 
is simply one. Now assume 0 < e < 1, then the above probability reduces to P(X(im) < 
eV m 2 n). Since, Xan) > Xanı), we have P(Xa:m) < eV m 2 n) = P(Xq € e). Now 
P(Xam) € €) = 1 — P(X am) > €) 21— (1— €)" 2 1 and hence X, “4 0. 


2 Convergence in the r th mean 


Consider the random variables X, and X defined on a common sample space such that 


E|X,,|" < oo for some r > 0. Then X, is said to converge in the r th mean to X if 
E|X, — X|" 3 0 


provided E|X|'" < oo. In notation, we use X,, > X. It should be noted that if X,, > X for 


some r > 0 then Xn Ere d However, the converse may not be always true. 


3 Stochastic order relations 


We have already introduced big O and small O notations for ordinary sequences. Similar 


notations can also be introduced to simplify the notions regarding stochastic convergence. 


Stochastic big O: For a sequence of random variables X,, if for every e > 0, there 


exists a positive constant K (e) and an integer ng = no(e) such that 
P(|X,| < K(e)) 31—e n> nole), 


then X, — O,(1),ie. X, is bounded in probability. 
For sequences of random variables X, and Y,, if for every e > 0, there exists a positive 


constant K(c) and an integer no = nole) such that 
Xn 
PII I< K(g) 21-6 n2nw() 
then X, = O,(Y,). 


Stochastic small o: For a sequence of random variables X,, if for every e > 0 and 


n > 0, there exists a positive integer ng = no(e, n) such that 
P(|X;| > n) <E, Nn > no(e, n), 


then X, = op(1). 

It is easy to observe that X, = o,(1) & X, £, 0. One can use Euclidean norm to extend 
the idea for vector valued random variables. 

For sequences of random variables X, and Yn, if for every e > 0 and 7 > 0, there exists a 


positive integer ng = no(e, 7) such that 
Xn 
Pi >) «6 nz nhen), 
then Xn = op(Yn). 


For a sequence of random variables X,, if X, = o,(1) then X, = O,(1) also but the 


converse is not true, in general. As an example consider X ~ N(0,1) and define X, = 


3 


(—1)"X. Then X,, 2 X and hence X,, = O,(1). But P(|X,| > €) = 26(e) 2 1 Z 0 and hence 


we do not have the relation X, = 0,(1). 


We often use representations like X, = Y, + Rn where either Rn — 0 in probability or 
nR,, — 0 in probability. We can use small o notation to write these as X,, = Yp + 0,(1) and 


Xn = Yn + o,(1/n), respectively. 


4 Convergence of a series Of Random Variables 


We have already discussed sequence of random variables and their convergence. However, 
like ordinary series, we can also define a series with random variables. The different terms 
of such a series is either independent or dependent. Often our interest lies in the asymptotic 
nature of average or some standardised version of it. Consequently, we have laws of large 


numbers and central limit theorems. 


4.1 Weak law of large numbers & law of averages 


Let X,,n > 1 be a sequence of random variables and let S, = Yg- Xk- Consider a 
sequence of constants b,,n > 1, where b, is positive, nondecreasing and diverging to +00. 
Then Xn, n > 1 satisfies weak law of large numbers(WLLN) with respect to bn, if there exists 
a sequence of real constants a, such that 


Sn — An P 
> 0 
bn 


as n — oo. an are called centering constants and b, are called norming constants. Most 
often we take a, = E(S,) and b, = JVar(S,). 


Suppose a random experiment A is performed n times and f, is the sample frequency. 


Naturally, nothing can be predicted about the rate of occurrence P(A) based on a single 
trial. However, considering the averages bon > 1, we arrive at an experiment in which 
the outcome can be predicted with high accuracy. The Law of Large Numbers are often 


describes as law of averages. 


4.1.1 Different WLLN 


Depending on different models, different variations of WLLN exist. We provide below few 
of them. Markov WLLN: Consider a sequence of random variables X, with E(X) = Hk. 
If eee — 0 then 

Sn > »« Ay 

n n 


That is, X4,n > 1 satisfies WLLN. If, in addition, X, are independent, then we have 
Chebyshev's WLLN. 


Poisson's WLLN: Consider a sequence of independent random variables X, with 


Xn ~ Bin(1, un). Then 


n 


TL 


If uk = p, then the above WLLN reduces to that of Bernoulli. 


Khintchine WLLN: Consider a sequence of iid random variables X, with E(X1) = 


u « oo. Then 
Sn 


P 
— > p. 
n 


4.1.2 An Example 


Consider tossing of a coin n times. If f,, is the number of heads turned up then in gives the 
proportion of heads in n throws. Naturally, nothing is known about the probability of head 
at the outset. The Law of Large Numbers predicts that in will be close to the unknown 
probability for large n. Thus the value of in for large n would provide a close idea about 
the unknown probability. For better understanding, we simulate the tosses of a coin with 
success probability .7. For varying n(50,500 and 5000, clockwise), we have constructed the 
histogram of the proportion of heads in n trials. It is easy to observe that the distribution 
of the observed success proportion becomes more and more concentrated about the true 


proportion 0.7 as we increase n. 


Figure 1: 


oO - 
So yJ 
lwo 4 
oO 
0.75 0.80 0.85 0.65 0.70 0.75 
e. 
ire) 
Q 
Oo 
e 4 
o ca — 
l 


0.68 0.69 0.70 0.71 0.72 


Convergence of observed success proportion 


It is easy to observe that the distribution of the observed success proportion becomes 


more and more concentrated about the true proportion 0.7 as we increase n. 


4.2 Strong law of large numbers(SLLN) 


Let Xn, n > 1 bea sequence of random variables and let Sn = $77 , Xz. Consider a sequence 
of constants b,,n > 1, where b, is positive and diverging to +oo. Then X,,n > 1 satisfies 
strong law of large numbers(SLLN) with respect to bn, if there exists a sequence of real 
constants an such that 


Sh — an a.s. 
50 
bn 


as n — oo. Unlike WLLN, in case of SLLN, we consider a, = E(S,,) but b, = n. 


4.2.1 SLLN & Borel-Cantelli Lemma 


Almost sure convergence is not easy to check in a straightforward way and hence we need 
some sufficient conditions. The following result provides an way to establish almost sure 


convergence. 


Borel-Cantelli Lemma: If for a sequence of events An, n > 1, 355 , P(An) < oo then 
P(limA,) = 0. However, if A,,n > 1 are pairwise independent with $77 , P(A,) = oo then 
P(limA,,) = 1. 

Observe that 
X, — X 30 <> P(lim|X,, — X| > e) = 0 


for every e > 0 and hence we can use the above lemma to establish almost sure convergence. 


Large sample Inference: Module 14: 


What we provide in this module 


e Univariate Delta Theorem with general scaling factor 
e Multivariate Delta Theorems 


e Applications 


1 Co-Ordinator: Dr. Rahul Bhattacharya, Department of Statistics, Calcutta University 


1 Asymptotic variance & Delta Theorem 


From delta theorem, we find that if /n(T;, — 0) AN (0,07(0)) then under certain conditions 
vn(g(T,) — 9(0)) & N(0, [g (6)262(0)). The quantity [9 (0)]26?(0) is called the asymptotic 
variance(AV) of g(T,). However, the relation AV(g(T,)) = limno nVar|g(T;,)|] may not 
always hold. For example, suppose X, are iid N(1,1) variables and Ta = X. Then from 
CLT, /n(T;, — 1) 5 N(0,1). Define g(T,,) = 1/T,, then from delta theorem AV(1/T;) = 


However, exact calculation shows Var(1/T;,) does not exist and hence the proposition. 


2 Scaling factor & Delta Theorems 


For univariate delta theorems, we have assumed so far the scaling factor y/n. But in practice, 
scaling factor may be other than y/n. Let us explain using an example. Suppose X,,n > 1 are 
iid R(0,0) variables and let X(n) denote the largest order statistic. Define Y,, = n(0 — Xn), 
then Fy, (x) = 1—(1— 3)” 2 1—ezp(—2/0) V x. Hence n(0— X(.)) EL Exponential(mean — 
0). Thus we get a scaling factor n instead of the usual yn. Naturally, the already discussed 


univariate delta theorems are of no use in finding the asymptotic distribution of some function 


IX (n))- 


2.1 Delta theorem of first order with general scaling factor 


Result: Consider a sequence of random variables X,,n > 1 such that an(Xn — 0) D, X for 
some a, > 0 such that an — oo. Let g(0) be a real valued function such that g'(0) = $9(0) 


is continuous and non zero in a neighbourhood of 0. Then 


dn(g(Xn) — 9(8)) 5 g'(0)X. 


Proof: As earlier, a,(.X,, —0) Box implies X,,—60 £, 0. Then using Taylor expansion, we 
have the representation a,(g(Xn)—g9(9)) = an(Xn—0)(g'(0)+0p(1)). Since, an(Xn— 0) Sx 


we have an(g(Xn) — g(0)) = a4 (X4 — 0)g'(0) + Op(1)op(1) = an(Xn — 8)g'(0) + op(1). Now 
op(1) £, 0 and hence applying Slutsky theorem , the result follows. 


2.2 Delta theorem of higher order with general scaling factor 
Result: Consider a sequence of random variables X,,n > 1 such that 
aux, =O SX 


as n — oo. Let g(@) be a real valued function such that g (0) = “9(0) =0Yr = 
1,2,..,m — 1 but g™ (0) Z 0. Then 


mlan™(g(Xn) — 9()) 5 g (0) X". 


Proof: The proof of the above is an easy consequence of Taylor’s theorem of higher 


orders with random variables and hence omitted. 


2.3 An application 


Suppose X,,n > 1 are iid as R(0,0),0 > 0 and we are interested in g(X(,)) = logX (mn). It 
is already known that n(0 — X(,)) EM Exponential(mean—0). Define g(x) = log(x) then 
g'(x) = 1/r #0. Then applying the above we get that 


n(log(6) — log(Xi,)) 3 7. 


where X ~ Exponential(mean=6). Since € has a standard exponential distribution, we 


have that the asymptotic distribution of n(log(0) — log(.X(,)) is standard exponential. 


3 Delta Theorem: Multivariate case 


The Theorem: Consider a sequence of p component random vectors X,,n > 1 and a fixed 
p component vector 0 such that 


GS SX 


3 


as n — oo for some p component random vector X. Let g(0) be a real valued function such 


that the gradient vector Vg = (29. 29 29(9))T is continuous and non-zero in some 


80, > 803 ^ 08, 
neighbourhood(nbd) of 0. 'Then 


dn(g(Xn) — g(0)) 4 (Vg) X. 


Proof: Suppose that |X, — 0| denotes the Euclidean distance. Assume that Vg is 
continuous in the neighbourhood(nbd) |x — 0| < e. Then for |x — 0| < e using Mean Value 
Theorem(MVT), we have 


g(x) — 9(8) = (x — 6) / g (8 + t(x — 0))dt. 


As a consequence of the fact that a,(X, — 0) 2, X, we have for every ô > 0, P(|X, —0| < 
ô) = P(a,|X, — 0| < an) — 1 and hence X, — 0 20: 

Now for |x —0| < e, we have the representation an(g(Xn)—g(0)) = an(Xn—0) E g (0--t(X,,— 
0))dt, where due to continuity and the fact that X, 55 6, we have Jh g (0 - t(X,, — 0)dt zit 


g (0). Thus we have for vector valued random variable 


din(g(Xn) — g(0)) 5 (Vg)? X. 


In most of the cases, X ~ N,(0,5). Thus from the properties of linear functions of 
multivariate normal variables, we have the large sample distribution of a, (g(X,) — g(0)) as 


N(0, (Vg) X(Vg]). 


3.1 Examples 


Example 1: Suppose yVn(Ii, — 01, Ton — 03)7 e Nə(0, £). We are interested in the 
asymptotic distribution of g(T1, T5) = TT». Assume that g(x,y) is a function of the real 
variables x and y, is continuous and has continuous partial derivatives in the nbd of (04, 02). 


Then by the delta theorem for multiple random variables, the asymptotic distribution of 


Ty T5, is normal with mean 6,42. Now Vg(41, 92) = (05,0,)*. Hence the asymptotic variance 


comes out as (93041 + 20102012 + 02053) /n, where X = ((0;;)). 


Example 2: Suppose V/n(Ti, — 91, Ton — 92)" EA N3(0, 37) and we are interested in the 
asymptotic distribution of g(T1, T5) = T,/T>. Assume that g(x,y) is a function of the real 
variables x and y, is continuous and has continuous partial derivatives in the nbd of (01,05) 
except for y = 0. Thus if (01,05) is such that 05 Æ 0, then the asymptotic distribution of 
Ti, / T», is normal with mean 64/05. However, Vg(061,05) = (1/02, —0,/02)*. Consequently 
the asymptotic variance comes out as (01/05)?(011/02 — 2012/(0102) + 023/02) /n, provided 
6, £0 and X = ((oij)). 


Example 3: Distribution of sample variance Suppose X;, i > 1 are iid with mean p, 
variance c? and finite fourth order moments. Without any loss of generality, assume u = 0 
and ø = 1. Note that the sample variance s* = 1$? (X; — X = 1357 (Xi - uy — 
(X — u)?. Thus we start from the joint asymptotic distribution of Tin = 1377 ,(X;— u) and 
Ton = Iy (Xi — uy. By multivariate CLT, we have vn(Tin — 01, Ton — 02)” B N»(0, £), 
where 0, = 0, 6) = c? and 

S- H2 H3 
Hs a — Hà 
Now define g(x, y) = y—a?. Then g(x,y) is continuous and has continuous partial derivatives 
in the nbd of (64,05). Since Vg(61, 02) = (0,1)? Z 0, we find from the delta theorem with 
multiple variables that that \/n(s? — u2) zi N(0, (Vg(01, 02)) TEV g(01,02)). Now after a 
little manipulation we find (Vg(6;,05))7 X: V g(01, 02) = Ha — ui. 
If the underlying distribution is normal, then u4 = 30% and hence in such a case asymptotic 


variance reduces to 26^. 


As an easy consequence of the delta theorem, we find that J/n(s — o) = N(0, ae ), which 


under normality reduces to J/n(s — c) 2 N(0, gs 


Large sample Inference: Module 12: 


What we provide in this module 


Complete Convergence 


Different SLLN & applications 


Continuity Theorem & applications 


Central limit theorems 


1 Co-Ordinator: Dr. Rahul Bhattacharya, Department of Statistics, Calcutta University 


1 Complete convergence & Almost sure convergence 


Now we discuss another type of convergence, namely, complete convergence. Consider the 
random variables X,, and X defined on a common sample space. Then X, is said to converge 
completely to X if $77 , P(d(X,, X) > e) < oo for every e > 0. In notation, we use 
X, — X 5 0. 

Complete convergence is useful in establishing almost sure convergence. First of all note that 
X, — X - 0 implies X,, — X “4 0 but the converse may not hold. Thus from the properties 
of convergent series, if one can show that P(d(.X,, X) > €) = O(n'*") for some r > 0, then 


complete convergence and hence almost sure convergence holds. 


2 Different variations of SLLN 


Depending on different models, we have different variations of WLLN and we provide below 


few of them. 


2.1 Borel SLLN 


Result: Let X,,n > 1 be iid Bernoulli(0) variables. Then Sn ** 6. 

Proof: Note that S, ~ Bin(n,@) and using the properties of binomial distribution, one 
can find that E(5 — 9)? = 0(1—6)/n. Then by Markov’s inequality P(|22 — 6| > €) < 
E (22 — 0)?/e? and since E(2» — 0)? < 4, nothing can be concluded. 

However, if we consider E(22 — 0)* = n-?6(1— 0)[1 +30(1 — 60) (n — 2)]. Since 6(1— 0) < 1/4, 
we get 1+ 36(1 — 0)(n — 2) < (3n — 2)/4 < 3n/4 and hence we get E(2« — 8)* < 3n ?/16, 
that is, E(5» — 8)* = O(n?). Thus P(|2 — 8| > e) < E(22 — 0)!/e* < 3n^?/16e* for any 
€ > 0. Since the series 357 , P(d(Xn, X) > €) converges, the result follows. 


2.2 Other SLLN 


Kolmogorov SLLN: Let X,,n > 1 be independent random variables with finite variances. 


Then Sa — BO 4$ Q if py Var hed « oo. 


n n=1 n 


SLLN for bounded random variables: If X,,n > 1 be iid uniformly bounded 


random variables then 5» *$ E(X;). 


Khinthine SLLN: Let X,,n > 1 be iid random variables with finite expectation. Then 
XC EX): 
Now we consider an extension of the above in the context of multiple random variables. Let 
Xn,n > 1 be iid q component random variables with finite expectation 0. If g is continuous 
at 0 then g(X,) “¥ g(0). 
An application will be helpful to understand the idea. Let X,,n > 1 be iid random variables 
with finite variance o? and expectation u. Consider s? = 137? (X; — X). Now s? = 
z20ga (Xi — uy! — {5 302a 0G — 0)... Define Z = (Zu, Za) with Zu = (X; — p)? and 
Z», = (X; — u). Then Z 5 (c?,0)'. Since g(x,y) = x — y? is continuous, we have the desired 


result from the above. 


3 Continuity theorem 


Let X,,n > 1 be a sequence of random variables with PF,,n > 1 as the corresponding 
sequence of cdf's. If M, (t), n > is the corresponding sequence of MGFs then the following 


questions are natural: 


Does M,,(t) converge to an MGF for large n? 


If X, 3 X, for some rv X with MGF M(t), does M,(t) > M (t)! 


Continuity theorem, given below, provides the answers to the above. 


The theorem: Let F,,n > 1 bea sequence of DFs with M; (t), n > as the corresponding 
sequence of MGFs. If there exists a DF F corresponding to MGF M such that M,(t) > M(t) 


3 


as n — oo then F, 5 F. 


3.1 Applications of Continuity theorem 


Example 1: Let X, be such that P(X, = 1) = n? = 1 — P(X, = 2). Then M,(t) = 
E(e*^) = 5 + e? (1 — 1/n?) — e”. Clearly the limit is the MGF of a random variable 
having degeneracy at 2. Thus we get from the above that X» D X with P(X =2)= T; 


Example 2: Let X, be iid Bin(1,p) variables. Then Sp = $7; , Xx has the MGF 
M,(t) = (1— p + pe)". If we increase n in such a way that np = A is finite then (1 — 
p+ pet)” — eX*-U, which is the MGF of a Poisson random variable with mean A. Hence 


85 Poisson(A). 


4 Central limit theorems 


WLLN and SLLN provide an idea about the limit of an average of random variables. But 
it does not provide the rate at which the limit is reached. Central limit theorems enable us 


to get an idea about the rate of convergence to the limiting quantity. Now we shall discuss 


Sn—an 
bn 


convergence of to a nondegenerate random variable for suitable choices of centering 
and norming constants. We also investigate the properties of such a nondegenerate limit. 

The earliest form of CLT was postulated by Abraham de Moivre(1733) who used the normal 
approximation to the distribution of the number of heads of obtained through a large number 
of throws of an unbiased coin. However, the result of de Moivre was referenced by Pierre- 
Simon Laplace in his 1812 publication Thorie analytique des probabilits. Laplace extended 
the idea of De Moivre to approximate a binomial distribution by the normal distribution. 
However, the modern form of CLT is due to Aleksandr Lyapunov(1901), who developed 
the general form together with proof. The particular term ” central limit theorem” is due 


to George Plya(1920) because of its central role in probability theory. However, Le Cam 


interpreted the word "central because "it describes the behaviour of the centre of the 


distribution as opposed to its tails" (Lecam,1986). Now we discuss different forms of CLT. 


4.1 De Moivre's Limit Theorem 


Let X, be iid sequence of Binomial(1, u) random variables. Then as n — oo, X, x u and 
> D 


For a proof one can expand the concerned MGF in a Taylor series and take the limit to get 
the MGF of the normal variable as the limit. Then the result is immediate as a consequence 
of continuity theorem. This is the primitive version of CLT (De Moivre,1733, Laplace, 1810), 


which gives the asymptotic distribution only for a sequence of iid Bernoulli variables. 


4.2 Lindeberg-Levy CLT 


Let X, be iid sequence of random variables with mean p and finite variance o?. Then as 
n — œ 


vn(X,, — u) 5 N(0, 0°). 


Thus the limiting distribution of a normalized average is normal provided the observations 
are iid with finite variance. Thus the limiting distribution is independent of the parent 
population and hence enables to develop asymptotic procedures for a large class of parents 


only with finite variance assumption. 


4.3 Lyapounov condition & CLT 


Lindeberg-Levy CLT considers the sum of iid random variables with finite variance. But 
random variables may be independent but not distributed identically. In such a case we 


have Lyapounov CLT which assumes existence of moments higher than order 2. 


Let X, be independent sequence of random variables with mean E(X,) = j and 


Var(X,) = a2. If for some 6 > 0, the Lyapounov condition 
(o N EX — iu^ 20 
k=1 k=1 
as n — oo is satisfied then with s? = $77 07, 


TL X = 
Set 5 N(0,1). 


S 
k=1 n 


4.4 Lindeberg condition & CLT 


Lindebergs condition is an alternative to Lyapounovs condition for satisfying CLT. If Lya- 
pounovs condition is satisfied then Lindeberg condition is also satisfied but the converse is 
not always true. 

Let X,, be independent sequence of random variables with mean E(X,) = un and Var(X,) = 


o”. If for every e > 0, the Lindeberg condition 


(202 ? M E(QG — pe)? (|X — url > € 
k=1 


k=1 


as n — oo is satisfied then 
n X A 
X kTM R N(0,1). 


S 
k=1 n 


5 Uniform convergence in CLT 


CLT only provides the limiting distribution but not the choice of n and accuracy of approx- 
imation. However, choice of an n ensuring limit distribution to be normal depends on the 
underlying distribution. Suppose G',(x) is the cdf of Jn esca) and (x) is the DF of N(0,1) 
variable. Then CLT states that Ga(x) — (zr) for all x but the convergence is not uniform 
so that for fixed n, the approximation may perform poorly. The following result gives a 
bound on the margin of error |G;(x) — ®(x)| and shows that the convergence is uniform for 


the class of distributions with finite third order absolute moments. 


Berry-Essen Bound: Let X, be iid sequence of random variables with mean pu, variance 


c? and finite absolute third order moment p = E|X, — u|’. Then for all n 


cp 


o? /n 


where the constant c € (.4097,.7975) depends on the underlying distribution but is indepen- 


IG, (x) — 9(x)| < 


for alla < sup IGn(z) — (x)| < AU 


dent of n. 
Berry-Essen theorem gives a sufficient condition for CLT even for non identical class of dis- 
tributions. Thus we find a sufficient condition for CLT. 


Let X,, ~ Fn be an independent sequence of random variables with E(X,) = Hn, Var(Xn) = 
= o( /n). Then G,(x) — (x) as n > oo. 


Er, |X1— pa? 
02 such that En Puer 
T 


For an application, consider an independent sequence of random variables X, such that 


X, ~ Bernoulli(p,). Then F, = Bernoulli(p,) and Eg,|Xi — p,|? < 1 for any n. Thus the 


Ep,|Xi-ua| —— 


n 


CLT holds as long as p,(1 — pn) converges to a non zero quantity. However, 


p2+U—pn)? 


V P» (1—pn) 


—1/2 


and hence if (npn) — 0, the limiting distribution approaches normal. 


Large sample Inference: Module 16: 


What we provide in this module 


Asymptotic distribution of kurtosis coefficient 


Asymptotic joint distribution of skewness & kurtosis coefficient 


Asymptotic distribution of sample correlation coefficient 


Rate of convergence 


1 Co-Ordinator: Dr. Rahul Bhattacharya, Department of Statistics, Calcutta University 


1 Further applications of delta theorem 


Example 1: Distribution of sample kurtosis coefficient 

Suppose X,,n > 1 are iid random variables with finite eighth order moment ug. If we denote 
the r th order central moment by m, then the population kurtosis coefficient is 9 = P — 
provided u2 > 0. As earlier, m, S Ur for any r, and hence a consistent estimate of y2 is 
go = mi — 3 provided m» > 0. Then gə is a function of two sample central moments, namely, 
m» and m4. Now, to derive the asymptotic distribution, we define g(u, v) = 5 — 3, which is 
continuous and has continuous derivatives except for u — 0. Now from the joint asymptotic 
distribution of the sample central moments, we have /n(m» — p2, m4 — pa)? 2 N»(0, £) 

022 O24 


with X = 
O 44 


Now a straightforward algebra gives 


2u _ 
V g(us, pa) = (cou 
2 


and hence we obtain y/n(go — 72) 5 N(0, 72), where 


T2 = (Vg) Vg = (a + 3? h(ua, ua, Ua, H5, He, H7, Hs) 


for some function h. We refer the interested reader to the book by Serfling(1980) for an 
explicit expression of h. However, if the underlying distribution is standard normal ( due to 
location and scale invariance, one can WLG consider this distribution), then u3 = 0, u5 = 


0, u7 = 0, u4 = 3, ug = 15, ug = 105 and hence we obtain /n(ga — 72) R N (0,24). 


Example 2: Joint distribution of sample skewness & kurtosis coefficients 
Suppose X,,n > 1 are iid random variables with finite eighth order moment pg. Then 
with the already used notations, the sample skewness and kurtosis coefficients are gı = -3/2 
and y9 = 2 — 3, respectively, provided 45 > 0. Now we shall obtain the joint Sn 
totic distribution of such coefficients. We define a vector valued function g(u1, u2, u3) = 

gi(t1, U2, ua) 


where gı(u1, Ug, u3) = ug fu?” and go(u1, Us, u3) = u3/u? — 3. As indi- 
go(Uy, U2, U3) 


cated earlier, each g; is continuous and has continuous derivatives except for u; = 0. Now 
from the joint asymptotic distribution of the sample central moments, we have \/n(m2 — 


022 923 O24 


D 
Jia, ma — H3, M4 — p4)” > N2(0, £) with X3 = 033 034 
044 
Now we define the matrix G?*3 = (Cy (u2,113,4))) A Straightforward algebra gives 
_ 33 —3/2 
G= gu H2 0 
— A ie? 


Hence we obtain V/n(gi — ^i, 92 — Y2)! ES N(0, V), where V = GEGT. The exact expression 
of V requires a huge calculation and hence we skip this. However, if the underlying distribu- 
tion is standard normal ( due to location and scale invariance, one can WLG consider this 


distribution), then u3 = 0, u5 = 0, u7 = 0, u4 = 3, ue = 15, ug = 105 and hence we obtain 


2.0 12 
X= 6 0 
108 
Again we obtain 
0 10 
G= 
—6 0 1 


'Thus we obtain that 
Vnlg — 1, 92 — 2). E N(0, Diag(6, 24)). 


Thus we find that the sample skewness and kurtosis coefficients are asymptotically indepen- 
dent if the underlying distribution is normal. However, one can check that if the underlying 
distribution is symmetric about the origin with finite even order moments up to and including 


the order 8, independence holds. 


Example 3: Distribution of sample correlation coefficients 


Suppose (X;,Y;),i > 1 are iid random vectors with E(X7) < oo and E(Y?) < oo. De 


fine the correlation coefficient based on n pairs of observations by r = —2**—, where 
VSra8yy’ 


Sry = LY (X: — X)(Y; — Y). Naturally, r, is invariant to location transformation and 


hence WLG, we can assume E(.X;) = E(Y;) = 0 Vi. For nonnegative integers r and s define 
Urs = E(X{Y£) and m, = Ly XYE. Then s,, = mii — maomoi, Sze = M29 — mig 
and sy, = mo» — m,. Thus we observe that r, is a function of five moments, namely, 
M11, M20, M02, M01, M10. Now it follows from CLT that W/n(mo1 — Hor, io — Hio, M20 — 


M20, Mo2 — [402, M11 — pai)? 2 N;(0, X) with 3a Disp(X1, Y, X2, Y2, XiY.). Naturally, 


Hoi = Hi0 = Q. 
us — u? 
Define the vector valued function g(u1, U2, U3, U4, Us) = u4 — u2 . Thus with the 
Us — u1u» 
already introduced notations, we have G?*? = (Ca Louisa): Now a straightforward 
algebra gives 
00100 
G=| 00010 
0000 1 


Hence applying delta theorem, we get /n(ss4 — H20, Syy — 1402, Sry — H11)” E N3(0, GEGT). 
A simple matrix multiplication yields GEGT = Disp( X2, YP, X1Y;). Thus we have obtained 


the joint asymptotic distribution of Ssv, Syy and s;,. 


Finally, define the univariate function h(ui, u2, u3) = zs. Then a simple algebra gives 


vu 
p p l T 


» 2120" i 2402 ' J/i201102 


where, p = zia Thus by delta theorem, we get vn(r, — p) 5 N(0,o02), where o? = 


(V h(Uao, Moz; 22)) GEG? (V h (poo, 02, H22)). To obtain the exact expression of o?, we need 


V h(uzo, Hoz, H22) = ( 


to perform the above matrix multiplication and hence we omit the details. 
Naturally, the large sample variance depends on the unknown moments of the underlying 
distribution. However, the expression for the underlying distribution as bivariate normal is 


of special interest. Thus we assume that (X1, Y1) ~ N3(0,0, 1, 1, p). First of all we derive the 


4 


expression of Disp( X7, Y, X,Y;) under bivariate normal assumption. It is easy to observe 
that Var(X?) = Var(Y2) = 2. Now Cov(X?, Y?) = E(X2Y2) — 1. Applying the fact that 
Yi|X1 ~ N(pXi,1 — p’), we obtain E(X2Y?) = E(X2(1 — p? + PX?) = 14+ 2p? and 
consequently Cou(X?, Y2) = 2p”. Applying similar arguments and the fact the distribution 
of (X,, Yi) is exchangeable, one can obtain Cov(X?, X4Y1) = 2p = Cov(Y2, X1Y,). Finally 
Var(X3Y1) = E(X1Y1) - g^ 1 p*. 


Thus we get 
2 2p 2p 
Disp(X2, Y2, XY) = 2 2p 
14 p 


Again for a bivariate normal parent we get 


Vh(u20, Ho2, H22) = ee a 1 


A simple matrix multiplication finally gives V/n(r, — p) E N(O, (1 — p)?). 


After getting the asymptotic distribution, it is natural to study the rate of convergence. 
In other words we need an idea about the value of the sample size ensuring asymptotic 


normality. For numerical evaluation ,we assume a bivariate normal parent with correlation 


coefficient p = 0. For p = 0, it is well known that ryn — 2/41 — r? has a t distribution 
with df n-2. Thus it is natural to compare the asymptotic distribution with the actual. 


Hence we provide the normal QQ plot of the variables Tin = /n(r — p)/(1 — p?) and 


Ton = ryn — 2/ 1 — r? for n = 15,20 based on a simulation study. The plots can be found 


in the next page. 


Looking at the figures, we find that asymptotic normality holds even for n—15 or 20. 
Thus inferential procedures for n=15 or 20 observations can be based on this approximation 
and one can assure that the loss will be a very little. However, it is interesting to observe that 


the second variable whose exact distribution is t is also very close to a standard normal for 


5 


Sample Quantiles 


Sample Quantiles 


-4 


-5 


Normal Q-Q Plot Normal Q-Q Plot 


| o o 
Go 
—] Ww = 
N 
O 
| S a 
c 
[v] 
ad! 
4 [s] 
[5] oF 
a 
-| E 
oO 
Ou 
+ J 
"lo ; o 
T T T T T T T T T T 
-4 -2 0 2 4 -4 -2 0 2 4 
Theoretical Quantiles Theoretical Quantiles 
Figure 1: QQ plot with n—15 for T; and 7>(left to right) 
Normal Q-Q Plot Normal Q-Q Plot 
4 o [e] 
4 4+ 4 
ip) 
-| E 
= NF 
c 
oO 
ERI =] 
[s] 
o C7 
= a 
E 
| By 
- skal 
[ [9] 
[9] oo 
T T T T T T T T T T 
-4 -2 0 2 4 -4 -2 0 2 4 
Theoretical Quantiles Theoretical Quantiles 


Figure 2: QQ plot with n=20 for T, and T»(left to right) 


those sample sizes. But the results obtained for p — 0 can not be stated in a generalised way. 


After getting the asymptotic distribution, it is natural to study the rate of convergence. 
In other words we need an idea about the value of the sample size ensuring asymptotic 
normality. For numerical evaluation ,we assume a bivariate normal parent with correlation 
coefficient p = 0.5 and generate Tj, = n(r — p)/(1 — p°) for different values of n and 
construct the normal QQ plot. The plots can be found in the next page. 


Looking at the nature of the QQ plots, we find that as we increase n, the convergence 
becomes better. That is the limiting distribution becomes closer to the standard normal 
distribution. For n=120, we find that the approximation is good enough. However, we have 
used the theoretical value of p and hence the asymptotic variance to generate the QQ plot of 
Tin. But in practice, asymptotic variance is not known and hence to be replaced by suitable 


estimates. Such replacement, of course, makes the convergence rate slower. 


We have just discussed the convergence to normality for bivariate normal parent. But dis- 
tributions can be non normal also. So, we consider a particular type of bivariate distribution, 


namely, Mckay Gamma, defined by the density 


g^ le V*(y — x) 0 «mr «y«oo. 


For such a distribution, the correlation coefficient is simply vV/a/ (a +b) 0. Thus to study 
the convergence rate for non normal parent, we perform a simulation study with a = 1,5 = 3 
and c = 2 for different values of n. In particular, we prepare histograms of \/n(r — .5) for 
different choices of n based on the samples from the bivariate gamma distribution. All these 


can be found in the next page. 


We have chosen parameters in such a way that p = .5 is satisfied. We have already 
studied such a behaviour for bivariate normal parent. However, for bivariate gamma parent, 


the asymptotic variance calculation requires knowledge of the higher order moments and 


7 


Normal Q-Q Plot Normal Q-Q Plot 


o [9] 
a 
ein 
wn [^] 
= 2 
t oes E 
6 6 94 
o 2 
Q e: 
S0 * S 
no o y J 
t+ J [9] 
co 
[e] t+ 9 
T T T T T | T T T T T 
-4 -2 0 2 4 -4 -2 0 2 4 
Theoretical Quantiles Theoretical Quantiles 


Normal Q-Q Plot 


2 o 
o 

gia 

Ss 

c 

8 -- 

[S] 

o 

ao OF 

E 

oO 

9 m 
N || 
I 


-4 -2 0 2 4 


Theoretical Quantiles 


Figure 3: Convergence to normality with p = .5 for n=60,90,120(clockwise) 


Density 


0.8 


0.6 


0.4 


0.2 


0.0 


e 
-] D 
E 
O t 
| e 
Q 
N 
B i e 
—— 1 e L—— d e 
e 
[ T l | l 
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 
x x 
œ 
D 
o 
ee 
D 
O t 
$ sS] 
Q 
N 
BE 
e — — 
el 1 
-3 -2 -1 0 1 2 3 


Figure 4: Convergence to normality for n—100,150,200(clockwise) 


oO oO 
o co `] 
[te] Lo 
o | o 
"t wt 
o e 
E = 
GQ c | € o 
ES o s] 
Q Q 
[NI N 
e eo 
34 3 
fF &- & op) f- d a 4. -4 
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 
X X 


Figure 5: Convergence to normality for (p, = .1, p; = .2)n—40,50(left to right) 


hence, for brevity, we provided only the histograms. The histogram for n=100 shows a 
considerable amount of skewness. However, the extent of skewness diminishes as we increase 
n from 100 to 200. For n=200, we arrive at the histogram of a symmetric distribution. Such 
a phenomena was observed for bivariate normal parent with smaller sizes. Thus, one can 


conclude that, non normal parent, makes the rate of convergence slower. 


We have discussed so far convergence considering a continuous bivariate parent. But 
the underlying distribution can be discrete as well. So to know the rate of convergence for 
discrete parents, we consider Multinomial(n, pi, p2) distribution and prepare the histogram 
of the quantity /n(r 4 i icp)? for known values of pı, pọ and increasing values of 


n. The values of p, p» are chosen in a way to reflect high, moderate and low correlation 


coefficients. For each value of of (pi, p?) we simulate the value of the above quantity taking 


n as 40 and 80. All these can be found in the next few slides. 


10 


© o 
o `] e 
[ro] wo 
o o 
"b "t 
o`] (= 
E E 
D Oo mm D Oo 
$ s] $ o9 
a (m) 
N N 
o 7 ST] —À 
E x 
o — i ns 2 rd [Ep oe 
o | e [ l 
-3 -2 -i 0 1 2 3 -3 -2 -i 0 1 2 3 
x x 


Figure 6: Convergence to normality for (p, = .2, p = .3)n=40,50(left to right) 


oOo _ i e 
[z [zi 
(o M | 
o [zi 
HY | X 4 
[s [zi 
£g £g 
€ co | €. c | 
ET o S 
G Q 
N N 
o o 
o o 
o [zi 
I T | I T l 
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 
x x 


Figure 7: Convergence to normality for (p, = .3, p = .4)n=40,50(left to right) 


11 


2 
2 


o | ah 94 | 
o 0 | 
o] o | 
2 È 
o © O © 
$ 9 $ 97 
a à 
x x 
o | o7 
N N 
o] co] 
o o 
o^ o 
E ow -HE-A. os. A bt GEI we - yk. oh oho 
3 2 -- 0 1 2 3 3 2 -- 0 1 2 3 


Figure 8: Convergence to normality for (p; = .4, p» = .5),n—40,50(left to right) 


We have fixed parameters in such a way that p takes the values -.17, -.33,-.53 and -.82. 
As expected, with the increase of n, the histogram closely represent a symmetric distribu- 
tion. However, the height of the histogram increases with the increase in the theoretical 
correlation coefficient. Naturally, the spread reduces with such an increase. Thus, it is easy 
to say that the variability of the asymptotic distribution decreases with increase in the corre- 
lation coefficient value. Thus, for discrete bivariate parent, the variability of the asymptotic 


distribution is expected to depend on the theoretical correlation coefficient. 


12 


Large sample Inference: Module 17: 


What we provide in this module 


e Asymptotic distribution of sample quantiles 
e Asymptotic distributions of quantile based measures 


e Asymptotic distributions of extreme sample quantile 


1 Co-Ordinator: Dr. Rahul Bhattacharya, Department of Statistics, Calcutta University 


1 Asymptotic distribution of sample quantiles 


Let X;,7 = 1,2, ..,n be iid observations from a continuous cdf F(x). Then all the observations 
are distinct with probability one. Suppose Xa) < X(» < .. < Xin) denotes the full set of 
order statistics and in particular X(,) is the rth order statistic. For p € (0,1), the p th 
order population quantile is defined as £, = F~'(p) and the corresponding sample quantile 
is defined as Xj, where k = [np + 1]. We shall obtain the joint asymptotic distribution of 
X(,; and X(,. Note that order statistics are not explicit functions and hence delta theorem 


is not straightway applicable. 


It is well known that if X ~ F and F is continuous then F'~'(U) 2 X for U ~ EU. T 
Thus for a full set of n observations, we find Ug) = F(X(5),j = 1,2,..n, where Ug), j = 
1,2, ..n is the full set of order statistics from R(0,1) population. So we first find the asymp- 
totic joint distribution of order statistics from a R(0,1) population and then use delta meth- 
ods to obtain the joint asymptotic distribution of the general order statistics. Now we provide 


few auxiliary results from distribution theory. 


Lemma 1: If Y; T Exp(mean = 1),i = 1,2,..,n +1, then 


Yi Yic-Y; Yi Yo. Ya. D 
( mT YS? ST  ywHy ) = Ua), Uo), +, (ny) 
i=l a i=l i=1 a 


where Ug), j = 1,2,..n is the full set of order statistics from R(0,1) population. 


Proof: First of all, note that the joint pdf of Y;,2 = 1,2,..,n + 1 is 


n+l 
f(yi ya, Ynti) = exp(— XE < yi < 001-102. m. 


i=1 


Let us transform Z; = Yi + Yo+..+ Yj, j = 1,2,..,n + 1 so that we get 


y = 24 


Y; = Zj — Žij, J = 2,3,..,n 4 1. 


Then we naturally get the Jacobian matrix 


i 9 0 
y UT ueque 30 
ICN = 
(7) 
0 0 1 -1 


Now a simple expansion through the co-factors of the first row in a recursive way gives 
|J(2)| = 1. Since Y; > 0Vi, we have 0 < Zi < Za < .. < Zu < oo and hence we get the 


joint distribution of (Zi, Z2, .., Zn41) as 
f(z1,22,.., 2441) = exp(—2441M (0 < 21 < 2 < .. € 241 < 00). 


Next we apply the transformation 


Za 
ye 12 j=1,2,. n 
i Zn+1 d 
Voti m Zi 


so that we get Z; = VjV444,j = 1,2,.,n and Znyi1 = Vn4i. Thus we naturally get the 


Jacobian matrix 


Uayr 0 0 n ù 
O Uny O 7 V9 
JÉ) = 
0 uu Unyi Vn 
0 0 1 


Now expanding through the co-factors of the last column, we get |J(£)| = v5,,. Hence we 
get the joint distribution of V;,7 = 1,2,..,n + 1as 


ezp(—Un+1) Uny I 


Ff (v1, Ve, -; Ungi) = nH(0 < v4 < v <.. < Un < 1) In t 1) 


(Un+1 > 0) f 


Hence (Vi, ..., Vn) is independent of V,41, where the joint distribution of (Vi, ..., Vp) is sim- 
ply that of n order statistics from R(0,1) population. However, V,,4 ~ Gamma(rate = 


1, shape = n + 1) and this completes the proof. 


3 


Observation: From the above result, it is evident that 
22 iYi pe i Yi D 
(Sry, ye Y; )= 


for r « s. Thus we have expressed the equivalence of joint distribution of order statistics 


(U(r), U(s)) 


with a function of sums of iid random variables. This equivalence enables us to apply CLT 


on the equivalent sets of random variables. 


Lemma 2: If /n(r/n — pi) — 0 and /n(s/n — po) > 0 with 0 < pı < p2 < 1 then 


Dp 
n+l Pı 


D ; 
vn Bh — (p2 — pi) A N3(0, Diag(pi, p — Pi, 1 p2)), 


Tels 
S a) 


where T; = m Y;. 


Proof: Consider the representation, vi- pı) = va = za) + vnga a n). 
By CLT, the first term in the RHS of above converges in distribution to N (0, p1) whereas 
the second quantity converges ordinarily to 0 due to the assumption that yn(r/n— pı) > 0. 
Hence by Slutsky theorem ,/n(— ) 2 N(0, p). 


n+l 


In a similar way the other convergences can also be proved using the CLT for iid random 
variables in combination with the facts that /n(r/n — pı) — 0 and V/n(s/n — po) > 0. 


However, the variables are independent exactly because they are based on disjoint sets of iid 


random variables. Hence the result follows. 


Lemma 3: If /n(r/n — pi) > 0 and /n(s/m — po) —^ 0 with 0 < pı < po < 1 then 


Us te i 
Js "ES 2 N,(, pı(1— pı) m(1- pə) ! 


U(5) — pa p2(1 — pa) 


Proof: First of all note that from Lemma 1, 


U Tr Tr 

(r) D Tn+1 D TA T-T Tr+Tn +1— Ts 
U. Ts Tr+Ts-Tr 

(s) Tr+1 T,r+Ts—T,+In41—-Ts 


Thus we define the vector valued function g(u1, uz, us) = (Ls mer d Then it is 


easy to observe that 


T5 Ts - T, Tui = Ls I, T; D Ur) 


oe n ^| n ) = Una Ta S Us) 


; LSP cy cO 
Hence with already introduced notations, we get G — . Thus ap- 
Lg dc». 
plying delta theorem we have /n(U(r)—p1, Us) —p) Es N5(0, GDiag(pi, pa—pi, 1—p2)GT). A 
1— 1— 
simple matrix multiplication gives G.Diag(pi, pa—pi, 1—p3)G7 = pü-m mü-») 
p2(1 — pa) 
and hence the result follows. 
A generalization of Lemma 3: If /n(ri/n — pi) > 0,4 = 1,2,..,k with 0 < pı < 


P2 <.. < Pk < 1 then 


D 
Vn(Um) — Pi, Ura) — P22, +, Us) = pr)” rub N&(0, p(1 = p)^), 


where p = (pi; pz; Sp 

Now we are in a position, to derive our original result, that is, the asymptotic distribution 
of sample quantiles. 

Asymptotic distribution of order statistics: If X(j,; = 1,2,..,n, are order statistics 


from a continuous DF F having continuous density F'(x) = f(x) such that f(x) is positive 


in an nbd of population quantiles £,,, i = 1,2, .., k defined by F(£5) = pi i = 1,2, .., k then 


D 
Vn(X in) an Epis X (n2) Ep2; es X (rp) T E =x N&(0, M), 


; i(1—pj 
where © = ((oij)) with cij = OG: 


Proof: The proof of the above result follows from a simple application of delta method. 
Since by the assumption of the result, F is continuous and hence F™t(u) is well defined, we 


define the vector valued function g(u1, U2, .., uy) = (F-1(uj), F7 (u2), .., F-1(uy))?. Then 


with the already introduced notations, G, the matrix with elements as differentials, is a 


diagonal matrix with elements 27— oe ll = (f(F-1(uj)) !. Hence 


D 
V(X in) zE Epis X (r2) mi 1 X (rg) a. Ge) > N; (0, Gp(1 z p)” G7). 
A simple matrix multiplication yields the final result. 


Asymptotic marginal distribution: With the already introduced notations and 


assumptions, for r = [np + 1], V/n(Xq — £j)" 5 N(0, 02), where o? = p(1 — p)/ f*(£,). 


2 Asymptotic distribution of quantile based summary 
measures 


Now we shall derive the asymptotic joint distribution of quantile based summary measures 
like sample quartile deviation and median. or a sample of size n, sample quartile deviation 
is defined as Q = (Xis) — X(ry)/2, where r = [n/4 + 1] and s = [3n/4 + 1]. If Q denotes 
the population QD, then it follows from the joint asymptotic distribution of sample order 
statistics that /n(Q — Q) 4 N(0,7?), where 7? = &(3/ f?(&) + 3/ f*(&) — 2/ f (&) f (63)). 
However, for a symmetric distribution, f(£1) = f(£3), and hence 7? reduces to {16 f?(€,)}7t. 
Again for a sample of size n, sample median is defined as M, = X), where r = [n/2 + 1] 
If M denotes the population QD, then it follows from the asymptotic distribution of sample 
order statistics that /n( M, — M) 2 N(0, 7?), where 7? = iP 


3 Asymptotic distribution of extreme order statistics 


We have assumed so far that J/n(r/n —p) — 0 for p € (0,1). But for extreme order statistics 
(e.g.r = 1,n — 1,n), = — O0 or 1. Thus we need to develop the corresponding asymptotic 
distribution separately. 


For a general development, first assume that k/n — 0, that is k is fixed. It is already proved 


in Lemma 1 that for fixed k, U(x) 2 ae where T, 2 75 1Y; for iid exponential(mean=1) 
random variables Y;. Since, Y; are iid exponential(mean=1) random variables, we have 


P 


by WLLN, Tui — 1. Hence from Slutsky Theorem nU) = TOU Es Ty. Since TX ~ 


Gamma(rate = 1, shape = k), we have nU (x) EA Gamma(rate = 1, shape = k). 


Next assume that k/n — 1, that is k = n,n — 1 for example.In particular consider k = n. 


Then from Lemma 1, Un) E D where Ty 2 SiL Y; for iid exponential(mean=1) random 
variables Yi. Hence n(1 — U(,;) 2 n(1— 7) = LAS . As earlier, it follows from WLLN 
n+l n+1/n 


that Tasi £, 1. Hence from Slutsky Theorem n(1— Ug) E Gamma(rate = 1, shape = 1). 


2 n(1 — nem) c e RE 


D 
md tae Gamma(rate = 


In a similar way, for k = n — 1, n(1 — U(, 1) 


1, shape — 2). 


4 Asymptotic distribution of sample range & midrange 
for uniform parent 


We start with the joint asymptotic distribution of extreme order statistics for uniform sam- 
ples. Specifically, if U;, i = 1,2,..,n are iid R(0,1) variables, we require the joint asymp- 
totic distribution of Uq) and Um). Take a, > 0 and b, > 0 and consider P(a,Ug > 
£N 0ntd Us) > y). 

Since P(anUa) > £ N bra(1 — Um) > y) = P(E < U: < 1- #,i = 1,2,.., n), we get the final 
expression as (1 — # — =)" provided 0 < = < 1 — # < 1. Thus it seems sensible to take 


an = bn = n and as n — oo, the above inequality is satisfied for positive x and y. 


N TT nUn) pln 
Thus for positive x and y, lim; ,4,,(1— #— £)" = e^*^*. Hence > : 


8 


where Y;,2 = 1,2 are iid exponential with mean unity. 


Next we proceed to the asymptotic distribution of sample range and midrange. Let 
U;,i = 1,2,..,n are iid R(0,1) variables and we want to obtain the asymptotic distribution 


of sample range Rn = Um) — Ua) and mid range Mn = (Um) + U))/2. We have already 


: nU() pn {MN . " 
derived that for uniform sample, => , where Y;,i = 1,2 are iid 
n(1 — Ut) Y) 
exponential with mean unity. Hence using delta theorem with g(x,y) = x + y, we observe 
nUga + n(1 — Um) = n(1— Rn) EN Y; + Yo ~ Gamma(rate = 1, shape = 2). In a similar 


way n(M,, — 1) = $(nU 1) —n(1 — Uy) E iv — Y>) ~ DE(O, 1). 


Large sample Inference: Module 18! 


What we provide in this module 


e Applying asymptotic results in inference 
e Variance stabilizing transformations 


e Applications 


1 Co-Ordinator: Dr. Rahul Bhattacharya, Department of Statistics, Calcutta University 


1 Large sample inference 


We have discussed so far, asymptotic distributions of different summary measures. These 
results form the basis of developing different inferential procedures concerning different pop- 
ulation characteristics. Starting from a general framework, we discuss both single sample 


and two sample procedures and assess their performance. 


1.1 Estimation in large samples 


Suppose, we have a large number of observations from a population indexed by the unknown 
parameter @(may be vector valued). Then one may be naturally interested in providing a 
good(say, MVUE) estimate of 0. However, a good estimator is not easy to obtain, particularly 
when we don’t have much idea about the parent population. However, if we confine ourselves 
to the class of estimators, which are very close to the actual parameter for sufficiently large 
n, we can use results of large sample procedures. For example, suppose we find a sequence 
of estimators T,,n > 1 such that /n(T, — 0) E N(0,0?) then T; Z @ and hence T, is a 


reasonable estimator of 0. Naturally, Tn may perform poorly in small samples. 


1.2 Large sample tests & confidence interval estimation 


Next suppose we are interested in Ho : 0 = 05 against Hı : 0 Æ @ and only a large number 
of observations from a population indexed by the single unknown parameter 0, are available. 
Naturally, the usual methods like Neyman-Pearson lemma or Likelihood ratio methods can't 
be adopted as the underlying population is not specified. Suppose we find a sequence of 
statistics Tp, n > 1 satisfying /n(T;, — 0) 2 N(0,07(@)). Then a large sample procedure 
rejects the null hypothesis at an approximate level o if VE > Ta/2. An associated 
confidence interval with approximate confidence coefficient 1 — a is easy to obtain from the 


set (0 : = Ahe. 


1.8 Problems regarding mean for single univariate population 


Suppose n observations are available from a population with mean u and finite variance o°. 
If we are interested in estimating u then a consistent estimator is simply X. Suppose we 
are interested in Ho : u = uo. For known øg, one can use /n(X — u9)/o and for unknown 
c, we use \/n(X — ug)/s. For large n, each variable has a N(0,1) distribution under the null 
hypothesis. Hence a large sample test can be constructed. Next assume o unknown, then 
to get a confidence interval for u we start from the relation P,( /n(X — u)/s) ~ 1 — o and 
arrive at the interval [X — Ty/28/)/n, X + T4/55/ /n] with approximate coverage probability 


l-a. 


1.4 Problems regarding variance for single univariate population 


Suppose we are interested in estimating c? or c. Then consistent estimators are simply 
sample variance s? — 15y (X — X)? and s. Suppose we are interested in Hp : o = oy. 
Assume p is unknown and parent distribution as normal. Then we use either vV2n(s —09)/600 
or J/n/2(s? —o2)/02, which are asymptotically N(0,1) under the null hypothesis. Hence large 
sample tests can be constructed. However, if the parent is not normal, then J/n(m» — u2) E 
N(0,7? = pa — ui). Since m4 is a consistent estimator of u4, we can use the statistic 
yn Ce where 7 is a consistent estimator of r. Naturally, such a statistic is asymptotically 
N(0,1), and hence tests can be constructed as earlier. Using either statistic, one can construct 


the confidence interval with approximate coverage probability 1 — a. 


1.5 Problems regarding mean for two univariate populations 


Suppose n; observations are available from a population with mean w; and finite variance 


c? 


7 i= 1,2. Assume that the two populations are independent. For u = u1 — u2, a consistent 


estimator is simply X, — Xə. Next suppose we are interested in Ho : Jj — u2 = uo. If ois are 


Sate If, we assume that n,/(n; + n3) > A € (0,1) and 


na/ (n4 4- n3) > 1— A, then it follows that yni + n2(X1 — ui, Xa — u2) EA N»(0, Diag(co2/A + 


known, one can use TJ), = 


o2/1— A)). Thus under the null hypothesis, T; is asymptotically N(0,1) and hence large 


sample tests can be constructed. 


1.6 Problems regarding mean for two univariate populations 


However, if o/s are unknown, we replace them by their consistent estimates s?,i = 1,2. 
(X1—X2- uo) 
vV/ 51 [mi -s2/na 

tests can be constructed. In addition, considering IL as a pivot for u, one can find 


Then applying Slutsky's theorem T} = is asymptotically normal and hence 


an approximate confidence interval [X4 — X3) - ——@2.—, (X, — X3) + ——24.—] with 


4/82 /ni+s2/n2’ M 52 /n1 2 /n2 


approximate coverage probability 1 — a. 


1.7 Problems regarding variance for two univariate populations 


Suppose n; observations are available from a population with unknown mean pu; and finite 
variance o?,i = 1,2. Assume that the two populations are independent. For 01/02, a 
consistent estimator is simply s;/s2. Next suppose we are interested in Ho : o? — o2 = 0. 


For simplicity, we assume normal parents. Then considering the fact that J/n;(s? — o2) = 
(51—53—0) 
[204 [n1 4-20* [na : 
unspecified value under the null hypothesis. Thus we replace o? by its pooled estimator 


where c? is the common 


N(0, 202) independently for each i, we consider Ta = 


s? = (nis?--n3s2)/ (n1 -- n3). Since, under n,/(n44-n3) — A € (0,1) and na/(nı+n2) > 1—A, 


s? is consistent for c?, it follows that T, = Op ipa. nm 0, 1) under the null hypothesis 
V/2s [ni 28 [ng 


and hence large sample tests can be constructed. 


Again considering the fact that \/nj(s; — o) EXON (0, 02/2) independently for each i, we 
(s1—s2—0) 
yot [2n +04 [2na 
under the null hypothesis. As earlier, we can replace o? by its pooled estimator s?. Then 


can alternatively consider 77 — , where o? is the common unspecified value 


(s1—s2) 


as before, ar ee AN (0,1) under the null hypothesis and hence large sample tests 
S TuS N2 

can be constructed. However, if normality is not assumed, then we should use the fact that 

T} = Mas E — N(0, 1), where moment based estimators are used to estimate the standard 


si -s$— (si-s$-ei-o3) 23) 


A/ 285 /n1+2s4/n2 


error. Further considering < as a pivot for o? — o2, one can find an approximate 


confidence interval. In a similar way considering -51—2-91—9). as a pivot for c1 — o», the 
M s4 /2n1--s* /2n3 


associated confidence interval can be obtained. 


1.8 Problems regarding coefficient of variation 


Suppose n iid observations from a normal population with unknown parameters are available. 
Denote the population cv and sample cv by V and v, respectively. Naturally a consistent 
estimator of V is v. Suppose, we want to perform a test for Ho : V = Vo. Then looking at 
the fact that Again considering the fact that T, = /n(v — V) 3 N(0, V*(1-- 10742V?)), we 


use the statistic Tp, = yn ; we) , which under the null hypothesis is distributed as 
"LI 
N(0,1). Thus we reject the null hypothesis if Ta, > Ta or Ta < —T or |T;| > Tas2 according 


as the alternative is H : V > Vo or H: V < Vo or H : V z Vo.For an approximate confidence 


E » (v-V) Ed u 
interval for V, we start from the relation Pins diosa X Tajo} © 1 — a to get the 


confidence interval [v — NETTO + 10-42v?),u+ DONE + 10742v?)]. 


1.9 Large sample tests for normality 


Suppose n iid observations from a population with finite moments up to order eight. Often 
we are interested in testing the null hypothesis that the underlying distribution is normal. 
Suppose 7; and ^» are, respectively, the coefficient of skewness and kurtosis. Then it is well 
known that for normal parent ^; = 0 and 72 = 0. Thus, we can perform tests of normality 
via Hoi : yı = 0 and Hos : yo = 0. For testing Hoi, we use the statistic vn/6g:, which 
is asymptotically N(0,1) under the null hypothesis. However, for testing Hoz, we use the 
statistic \/n/24g2, which is asymptotically N(0,1) under the null hypothesis. For making 
the first test n=30 is sufficient to reach normality whereas for the second test n=200 ensures 
normality. But, these are tests for skewed and symmetric alternatives, respectively, and 
hence one can consider the combined hypothesis Hos : ^4; = 0 ,y2 = 0. Then an obvious 
statistic is n(g2/6) + n(g2/24) which under the null hypothesis and normality, is distributed 


asymptotically as x2. 


2 Variance stabilization 


2.1 Why stabilize ? 


Suppose iid observations from k populations indexed by parameters 0;, i = 1,2,.., k. Suppose 
a consistent estimator 7;,,? = 1,2,..,k is available. Then often the asymptotic variance 
0?(0;) depends on 6; and we need to replace them by consistent estimates for making a test. 
Such substitution makes the convergence slower. Thus, it would be better if we can use some 
transformation to make the asymptotic variance independent of the parameter. 

Thus we need some transformation h(T) for some continuous function h with a non zero 
derivative at 0, such that the asymptotic variance (A/(0))?0?(0) is independent of 0. Hence 
we need to decide h such that (A/(0))?0?(0) = c? is satisfied for some c independent of 0. 


Thus h must satisfy the differential equation ah) = 2(8j so that h can be determined up to 


a change of origin and scale. 


2.2 Examples 


Binomial proportion: Suppose X;,i = 1,2,.. are iid Bin(1,p) variables. Then \/n(X — 
p) AN (0,p(1 — p)). We are d for some transformation h(X) making the asymp- 


totic variance constant. Then h(p) = cf ES + k, which on integration gives h(p) = 
2csin- ! (J/p) + k. Choosing c = .5 and k = 0, we get the transformation h(p) = sin! (y/p). 


Poisson mean: Suppose X;,7 = 1,2, .. are iid Poisson(0) variables. Then \/n(X —0) S 
N(0,0?). We are "ps for some transformation h(X) making the asymptotic variance 
constant. Then h(p) = c [5 + k, which on integration gives h(@) = 2cV/0 + k. Choosing 
c= .5 and k = 0, we get the transformation A(0) = V0. 


Normal variance: Suppose X;,i = 1,2,.. are iid N(j,o?). Then Vn(s? — o?) 5 


N(0,20*). As earlier, we need some transformation M) making the asymptotic variance 


independent of o°. Then h(o (o°) = logo’ +k. 


Choosing c = V2 and k = 0, we get h(o?) = logo?. 


Correlation coefficient: Suppose (X;,Y;),i = 1,2,.. be iid observations from a 
bivariate distribution with correlation coefficient p. Then for the sample correlation coeffi- 
cient r, it is already deduced that yn(r — p) 2i N(0, (1 — p?)?). As earlier, we need some 


function h making the asymptotic variance independent of the parameter. Then as earlier, 


Alo) = cf is + k, which on integration gives h(p) = slog T£) +k. Choosing c = 1 
and k = 0, we get the desired transformation h(p) = jlog( 1*5) = tanh~'p. This particular 


transformation is known as Fisher’s Z transformation. 


2.3 Convergence rate of the variance stabilized transformation 


The greatest advantage of variance stabilizing transformations is that normality is reached 
more rapidly than the original statistic. As an example, consider Fisher’s z transformation. 
It is already noted that /n(r — p) + N(0, (1 — p2)2). Then from Slutsky theorem, it follows 
that Tin = vine) B, N(0,1). Again for Fisher's Z transformation T», = yn(tanh™tr — 
tanh. p) BN (0,1). Thus for comparison, we conduct a simulation study with bivariate 
normal distribution having correlation coefficient p and provide a histogram together with 


the normal density plot for both the variables for different choices of n and p. All these can 


be found in the next pages. 


It is observed that for both the variables the convergence to normality is reached for p 
near 0. The transformed variable converges at a faster rate. However, for very large and very 
small values of p, the convergence to normality is not seen even for n—150 for the original 
and for n=100 for the transformed. In particular for p nearby +1, the distribution is quite 


different from the standard normal for large n. 


Density 


S os ho 
[e] TS [e] £s 
/ ` ^ —w 
‘ ` f M 
/ " 1 = 
e | ‘ ` e | d \ 
o ' Ñ o 1 x 
d \ | \ 
\ > h 
a | d \ $9 a | f \ 
oS d n o 9o u n 
1 " a 1 s 
d D | \ 
j 4 
2 : D = f Pe ©] 
oT ` ó \ 
, N d ` 
id s r M 
Paul = xt i 
o L- Yaa o Le == 
o o 
T T l I 
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 
X X 
p P 
[e] is 
f 
f ` 
d D 
e | d \ 
o 1 à 
d D 
j = 
ES i \ 
a a | D 
5 S , ` 
Q 1 Y 
! D 
1 
= \ 
b z \ 
o 
1 H 
i M 
Ld ` 
- 
eu due `~ 
o 
I T T 


Figure 1: Convergence rate of T, for p = —.6,.3,.7 with n=150(clockwise) 


SE ei 
2 
N _| ons 
= S PH 
2 [ ii 
T-— / A 
€ | / i 
> 0 2o 
€ Sot 2 ‘ M 
I] D F1 
$ 5 \ 
eo y [s] 1 
C S QO 67 1 \ 
` 
* 4 =| : Y 
5 el F 
PL Bur Tuc ri \ 
a 4 `~ e " ` 
1 p ` 
o m" i ya N 
a P4 S 
2 e anaf S 
[s] [s] 
I T T T | I | l 
-1.5 -1.0 -0.5 00 0.5 1.0 1.5 -4 -3 -2 -i 0 1 2 3 
X X 
Me 
à 
£o 
= 4 
o = 
o 
a 
My | 
[s] 
ae Se Pl1-31 
DE c Eu Em 
e 
[s] 
I T T 


Figure 2: Convergence rate of 75 for p = —.6,.3,.7 with n=100(left to right) 


Large sample Inference: Module 19! 


What we provide in this module 


e Applications of variance stabilization for k populations 
e Pearson chi-square tests 


e Applications 


1 Co-Ordinator: Dr. Rahul Bhattacharya, Department of Statistics, Calcutta University 


1 Use of Z transformation for bivariate populations 


1.1 Inferential procedures for a single population 


Suppose (X;, Y;),i = 1,2,.. are iid observations from a bivariate normal distribution with 
correlation coefficient p. Suppose, we want to test Ho : p = po for p # 0. Then we 
can use the statistic Ta = \/n(tanh~'r — tanh !pg), which under the null hypothesis is 
asymptotically N(0, 1). Thus we reject Ho against Hı : p Æ po if |T5| > Taz. Similarly the 
one sided hypothesis can also be considered. Again considering \/n(tanh—'r—tanh~'po) as a 
pivot, one can obtain a confidence interval [tanh(z4), tanh(z3)] with approximate confidence 


coefficient 1 — a, where z; = tanh“! (r) — Taj2/ /n and zz = tanh Y (r) + Taj2/ vn. 


1.2 Use of Z transformation for k bivariate populations 


Estimation: Suppose we k independent samples from k bivariate normal populations with 
common correlation coefficient p. Assume that n; observations are available from the i th 
population and r; is the sample correlation coefficient for the i th population. Suppose it 
is required to obtain a pooled estimate of p. Define z; = tanh !r;, then asymptotically it 
has a normal distribution with mean ¢ = tanh~'p and variance 1/n;. Hence considering the 


" E anat 

inverse variance as the weights, we get the pooled estimator of Ç as G = pt. Thus à 
go "i 

pooled estimator of p can be obtained from the above as p = tanh(¢) = we 


However, 


an improved estimator can be obtained if we replace n; by n; — 3. 


Hypothesis testing: Suppose independent samples of size n; are available from k 
bivariate normal populations with correlation coefficient p; for the i th population. Suppose 
we are interested in Ho : py = p2 = .. = py. If r; is the sample correlation coefficient 
for the i th population, then jJ/m;(z; — Çi) is asymptotically N(0,1). If ¢o is the common 
unspecified value under the null hypothesis, then we may take T, = Y (n; — 3)(z; — Qo? 
which is approximately x?. However Cy is not specified and hence we replace it by pooled 


^ k ^ 
estimator Co = Hom to get the modified statistic T^ = Y7* (n; — 3)(z; — Co)? which 
i=1 iT 


is approximately X7 ,. Thus a large sample test would reject the null hypothesis if the 
observed value of T; exceeds x7. , a 

The same methods can be adopted for the homogeneity of k Bernoulli population, or Normal 
population with respect to variance or Poisson population. In, addition, for k independent 
samples of different sizes from the same population, one can , as earlier obtain pooled 
estimates of the parameter of interest. However, the subsequent development is routine 


and hence we skip the details. 


2 Asymptotic distributions related to quadratic forms 


Often our test statistic comes in the form of a quadratic form. If the exact distribution of 
such a statistic is not tractable, asymptotic methods are the only options. However, the 
asymptotic distribution of a quadratic form is not the usual normal but often it is chisquare. 
Thus we start with some basic asymptotic results on quadratic forms and later provide 


applications in statistical hypothesis testing. 


2.1 Few results 


Lemma 1: If X?*! ~ N,(j, X), with positive definite X, then Z = (X — 4) X! (X — p) ~ 


rot 


Proof: Since X is positive definite, there exists a non singular matrix C such that 
CECT = L. Now, if we define Y = C(X-), then Y ~ N,(0,J,). Now Z = Y? Y = 


1 Yr, where Y; are iid standard normal variates. Hence by definition, Z ~ x7. 


Lemma 2: If X;,27 > 1 are iid p component random vectors with mean u and dispersion 


matrix 3 > 0. Then 


where $, =~! 35 (X; — X)(X; — X)! is the sample variance covariance matrix. 


Proof: Using multivariate CLT, we have \/n(X — m) 5 Y, where Y ~ N,(0, X). Since, 
Sn = L3 (Xi - X)(X; - X) = 155 XX] — X(X)’, it follows from WLLN that 
Sn e (+ uut) — uu? = X. Hence from Slutsky theorem, W £, YTy-1y. Thus applying 


Lemma 1 above, we get YTYX-!Y ~ x? and hence the result follows. 


3 Pearsonian »? statistic 


3.1 The statistic & the asymptotic distribution 


Suppose n independent trials are performed. Each trial has k mutually exclusive and ex- 
haustive outcomes Aj, A2, .., Ag with p; as the probability for the j th outcome. Naturally 
Xo p; ^ 1. Let f; be the number of trials with outcome A; so that $` f; =n. Then Pearson's 


X? statistic is defined as 
k 


pe (f; — np; 
i i=1 npj — 


Now we deduce the asymptotic distribution of such a statistic. 


Result: As n — oo, 


D 2 
Ta > Xg 


Proof: Define the k component outcome vector for the i th trial X®, where P(X? = 

1) = pj; and PE = 0) = 1 — p; so that exactly k-1 components of X are zeros 
and a one such that VPE A = 1. Then X,i = 1,2. are iid vectors with X? ~ 
Multinomial(n, pi, .., Pe), i = 1,2,..,nsothat E(X®) = p = (pi, po, .., py)? and Disp( X?) = 
Diag(pi po, ..,px) — pp’. However Disp( X?) is singular and hence, we introduce Y = 
(x&, x9. E 4 = 1,2,.., n, then Y are iid vectors with EY = (pi, po, .., py 4)? 
and Disp(Y ®) = X = Diag(p1, po, Pr—1) — (P1, Pas +) Pk-i)! (Pı, P2, ++) Pea) 

Since X is positive definite, it follows from multivariate CLT that Zn = /n(Y—(p1, po, .., Pk—1)) 2 
Ny (0, X). Hence, it follows from Lemma 2 that ZT! Z, EN X2 ,. Now 7! = Diag(1/pi, 1/pa, .., | 
SIE. where p, = 1 — 2-1 pj. Then noting the fact that Y = (fi/n, fi/n, .., fk-1/n)T and 


4 


fr =n- Sa fi, à simple algebra with quadratic form expresses Z7~1Z,, as Tą. This 


completes the result. 


3.2 Applications of Pearson's y? statistic 
3.2.1 Tests for goodness of fit 


Often the underlying distribution of the data is not known and traditional practice is to use 
normality. But such an assumption makes the inference weaker. Therefore the experimenter 
wish to know whether the underlying distribution is of a given form. That is, the objective 
is to know which distribution "fits" good to the observed data. Tests for determining the 
underlying distribution, which is a good fit to the observed data are called goodness-of-fit 
tests. 

We start with a motivating example. Suppose the number of misprints per page of a book 


of 300 pages are represented in the form of a grouped distribution: 


No. of misprints/page | 0 1 12 |3| >4 
Number of pages 200 | 84/13/1310 


Here the number of observations is quite large to assume normality. But the variable is dis- 
crete with small number of categories. It is well known that the number of misprints/page 
has, in general, a Poisson distribution. But the mean, i.e. the only parameter, is not spec- 
ified in advance. Then how to test the hypothesis that the observations are coming from a 
Poisson distribution with unknown parameter. 

For better understanding, consider another example. Consider the heights of 100 students 


in a certain class in the form of a grouped distribution: 


Height(cm.) | 154-159 | 159-164 | 164-169 | 169-174 | 174-179 
# students | 4 15 54 22 5 


Here the number of observations is quite large to assume normality. However, the data is 
presented in the form of a frequency distribution. Moreover, the distribution of height is 
known to be normal. But he even under the assumption of normality, the parameters are 
not given. Then how to check the observations are coming from a normal distribution. 

From these examples, we see that we often have data in the form of frequency distributions 
either from a known discrete distribution but with unspecified parameter(s) or from a known 
continuous distribution with unspecified parameter(s). In any situation the objective was to 


test whether a given distribution fits the data well. 


Now we shall discuss such a procedure considering Pearson statistic. Consider n ob- 
servations on a random variable X. Let the range of X be divided into mutually exclu- 
sive and exhaustive sets A;,i = 1,2,.., k and observations are classified into these classes. 
Let f; be the number of observations classified in A;. Then X f; = n. Define pj 
as the probability that an X observation falls in A; under the null hypothesis. Suppose 
we have a hypothesis Ho : p; = pj,.,py = p), where p?,i = 1,2,...k are all specified. 
Then the joint distribution of (fi, fo,.., fk-1) is k-1 variate multinomial with parameters 
(n,pi,po,..Dk-i) . Thus under the null hypothesis f; are expected to be closer to np; for 
each i. Then departure from the null hypothesis can be measured by Pearson's Chi-square 
statistic T,, = X (fi ET E U* Now the higher is the discrepancy , the larger is the value of 
T, and hence a higher value indicates departure from the null hypothesis. Since, under the 
null hypothesis T) = Xs LM => x? , a large sample test rejects the null hypothesis 
IT o s 


Large sample Inference: Module 20: 


What we provide in this module 


e Applications of Pearson x? 
e Consistency 


e Comparing consistent estimators 


1 Co-Ordinator: Dr. Rahul Bhattacharya, Department of Statistics, Calcutta University 


1 Applications of Pearson's x? statistic 


1.1 Tests for goodness of fit 


We have already discussed that a goodness of fit hypothesis takes the form Ho : p, = 
D),..,px = p}, where p? may be completely specified or unknown. For known p?, we have 
seen how Uk = 2m B can be used to measure the discrepancy. Since, under the null 
hypothesis U; = pan ae 4 X2. ,, a large sample test, which rejects the null hypothesis 
if Uk > X; ,, was suggested. 

Now we shall discuss the corresponding procedure when ps depend on unknown parameters. 
Suppose under the null hypothesis p; depends on the unknown parameters (01, 65, .., 0,.). 
Thus, in such a case p; = p;(04, 05, .., 0), a function of the unknown parameters. Thus we 
need to estimate 0's and we suggest to use method of moments(MM). Suppose ĝ; i 2: 
be the MM estimates based on the data. If p; = pi(0, 6, xs 6), i = 1,2,..,k are the estimates 


of p;, i = 1,2,.., k, then Pearson’s Chi-square statistic takes the form V, = 3 RR 


It can be shown, as earlier that under the null hypothesis, Vg ES X2 ,.,88 n — oo. Since 
a higher value of V;, indicates departure from the null hypothesis, a right tailed test based 
on V, is appropriate. Thus, the large sample test rejects the null hypothesis if V; > x ao 
at 100(1 — a)% level of significance. 


Often the data is available in the form of a grouped distribution. If np;(or its estimate) is 
less than 5 for some i, the corresponding class is pooled with one or more neighbour classes 
so that the expected frequency for the combined class is at least 5. However, in such a case, 
degrees of freedom becomes number of classes after combining minus 1 minus the number of 


parameters estimated. 


1.2 Tests for association 


Suppose we have two characters A and B, where A has k classes A;,? = 1,2,.., k and B has 


l classes Bj, j = 1,2,..,1 and a population is classified according to these characters. Let 


2 


pi; be the population proportion for the cell (A;, B;) then se er = 1. Naturally, 
Pio = 2m pi; is the population proportion for A; and po; = NC Pij is the population pro- 
portion for B;. Suppose we are interested in Ho : A and B are independent or equivalently 
Ho : pij = piopoj V (i, j). Pearson's x? statistic can be used for testing the above hypothesis 
of association. 

First of all assume that (pio, poj) are all specified. Suppose n observations are taken from 
the above population and we denote the frequency of the cell (A;, Bj) by fi; Then the 
joint distribution of fii = 1,2,., k, j = 1,2,.,1 is multinomial with parameters n and 
Dij, 1 = 1,2,..,k,7 = 1,2,..,1. Then as before one can use the statistic pus eg LL 
to measure the departure from the null hypothesis. Naturally, the higher is the discrep- 
ancy , the larger is the value of the statistic. Since, under the null hypothesis 7T; = 
RUE DE LI D x3, ,, a large sample test rejects the null hypothesis if T;, > Xia 


provided (pio, poz) are all specified. 


Most often the null hypothesis does not specify (Pio, Poj) and hence requires estimation 


from the data.For the observed data maximum likelihood estimators are pio = fio/n and 


2 
Poj = fo;/n. Plugging in these estimates, we get the statistic, T} = n(Y 4 f - — 1}. 
Since, under the null hypothesis 77 E X24 zy a large sample test rejects the null hypothesis 


: / 2 
if T; > Xaia 


1.3 Tests for homogeneity 


Suppose we have k populations classified in the same manner according to a character A, 
where A has / classes Aj,j = 1,2,..,1. Let pi; be the population proportion for the class 
A; in the i th population. Thus y» pi; = 1 V i. We are interested in testing whether the 


k populations are similar(i.e. homogeneous). That is we want to test the null hypothesis 


Ho : pij = poj = .. = Pej = pj V j and Pearson's x? statistic is also useful in this case. 
Assume that p;'s are all specified. Suppose samples of size n; is taken from the i th popula- 


tionji—1,2,..,k. Define f; is the observed frequency for class A; for the i th population so that 


p fij = ni. Then the joint distribution of fij, j = 1,2,..,/ is multinomial with parameters 


l (fij nip; 


i roximatel 
EE s approximately a 


n; and p;;,j = 1,2,..,l independently for each i. Thus 5; 


fij-nipij? 


X? variate. Due to independence to a RN ( UN is approximately a Ales Since, 


= 
under the null hypothesis Tj, = 35. , yo (fa-mpp" D, Xi and a higher value is an in- 


nipj 


dicative of departure, a large sample test rejects the null hypothesis if T, > Xa provided 


pj are all specified. 


However, pis are not always specified and we need to substitute them by appropriate es- 


Total frequency for the j th class _ DE, fy 


timates. One can use the polled estimate p; = = 


for the purpose. Plugging in these estimates, we get the statistic T/ = Y. , ye Lm = 
2 
Hx s Zi — 1}, which under Hp is asymptotically XL P; Then, as earlier, a large 
2 7 = Taj 


sample test rejects the null hypothesis if T7, > X24 ae 


2 Yates correction for continuity 


It is already mentioned that, we need expected frequencies or their estimates to be large 
enough for x? approximation to hold. However, such a procedure can't be adopted for 
2X2 tables. Yates suggested a modification, called Yates correction for continuity, for the 
Pearson x? statistic to use in such 2X2 tables with low cell frequencies. The idea is that for 


X ~ Bin(n, p) variable with large n, SEE) approximates P(X < x) better than the 
np(1—p 


usual approximation Meere In a similar way, 1— o( ==") approximates P(X > x) 
np(1—p np(1—p 


r—np 


better than the usual approximation 1 — ®( 5). The idea is to make a smaller(larger) 


np(1—p 
x slightly larger(smaller). 
Consider a 2X2 contingency table with observed frequencies a,b,c and in the clockwise order 


with deficiency to qualify for the X? approximation. 


Yates correction suggests to add or subtract .5 to each cell frequency keeping marginal 
totals unchanged. For ad > bc, we consider the modified frequencies a-.5,b+.5,c+.5 and 


d-.5, whereas for ad < bc, the modified frequencies a+.5,b-.5,c-.5 and d+.5 are suggested. 


Table 1: 2X2 Contingency Table 


Counts | Counts | Total 
Counts | a b a+b 
Counts | c d c+d 
Total atc b+d n=a+b+c+d 
: OQ 2 (ad—be+n/2)? (ad—bc—n/2)? 
Then the modified statistic X^ takes the forms CEES LEN CREN CEE and CERN CRE CHEN CEE 


respectively. 


3 Large sample properties of estimators 


Unbiasedness, possessing minimum variance are small properties of estimators. But often the 
estimators takes implicit forms or even if the form is tractable exact sampling distribution 
is difficult to obtain. Thus we need alternative criterion for evaluating the performance of 
estimators. One such property is consistency. The idea is closely related to convergence 


concepts of probability. 


3.1 Consistency 


Let X;, i > 1 be observations from a population indexed by a parameter 0. Suppose the inter- 
est is to obtain a good substitute of some real valued function g(@). Let Ta = T,(Xi, .., Xn) 
be an estimator based on data. It is desirable to have |T, — g(@)| as close as possible to 
zero. But |T; — g(@)| is a random quantity and hence varies across samples. Thus, as an 
alternative, we need to keep the possibility of a low |7;, — g(0)| as high as possible. Therefore, 
we need to maintain P(|T, — g(0)| < ej or P{sUPm>n |T5, — 9(0)| < €} as high as possible. 
As n increases, more and more information is gathered and hence for a reasonable T, one 
would expect P{|T, — g(0)| < €} > 1 or Pisup,.,, |Tm — g(0)| < €) — 1 for every e > 0. In 


the former case T, is said to be weakly consistent and in the later case T, is called strongly 


consistent for 0. However, we confine ourselves to weakly consistent or simply consistent 
estimators. 
3.1.1 Sufficient conditions for consistency 


Let Tn be consistent for g(0) and assume that T, has a finite variance. Then from Chebyshev 


inequality, for any e > 0, 
P([T, — 9(9)| > €) x €? E(T, — 9(0)F 


. Thus consistency of T, holds if MSE(T,) = E(T, — g(@)}? — 0. Now, we at once have 
the relation that 
MSE(T,) = Var(T;) + {E(Tn) — 0}. 


Thus we have the equivalent conditions for consistency as E(T,) — 0 and Var(T,) — 0. 


3.1.2 The condition is not necessary 


Suppose X; are iid random variables with mean u and finite variance o?. Define T, such 
that T, = X with probability 1 — 1/4/n and Tp = n with probability 1/./n. First we shall 


show that T, ER u. First of all note that for every e > 0, 
P(T, — u| < €) > (1 - 1/Vn)P(X = nl < 9). 


Since X 5 u, the first quantity in the RHS of above converges to unity. Again 1/,/n — 0 
and hence for every e > 0, P(|T, — u| < ©) — 1. However, if F, is the DF of X, then we 


observe that the integral 


f IT.|dF, = (1 — 1/8) / IX|dF, -- n/ v 


diverges and hence the expected value and hence the MSE of Tn is not finite. Thus, we find 


that T, is consistent but does not qualify for the sufficient condition. 


3.2 Comparing consistent estimators: Concept of efficiency 
3.2.1 MSE criterion 


If Tin and Ton are both consistent for 0, then MSE(T,,) — 0 for k=1,2. Thus the estimator, 
which converges at a faster rate is preferable. For example, if we can show that M SE(T3,) < 
MSE(To,), then Ti, converges at a faster rate. 

Let us explain the criteria by means of an example. Consider estimation of a? based on iid 
observations from a normal population with mean u and variance o?. Suppose, we consider 
three estimators, Ty, = cy4s?, k = 1,2,3 with Cin = 1/n , Con = 1/n + 1 and cs, = 1/n — 1. 
Then Ti, is MLE, Ton is minimum MSE estimator and 73, is the MVUE. Considering the 
fact that (n — 1)s?/o? = Y (X — (X))?/o? ~ x2 ,, it is easy to establish the ordering 
MSE(T4,) < MSE(Tin) < MSE(T3,). Hence To, converges at a faster rate. Thus it is 


interesting to note that the best estimator for small samples(i.e. MVUE) performs poorer 


in large samples. 


3.2.2 Asymptotic variance criterion 


However, finding exact MSE requires huge calculation and hence we develop a simpler crite- 
rion based on asymptotic normality. A consistent estimator T, for (0) is said to be consistent 
asymptotically normal(CAN), if in addition to consistency, asymptotic normality, that is, 
vn(T, — 0) = N(0,c?(0)) holds for some o?(0) > 0. Naturally among CAN estimators, 
the estimator with the smallest possible o?(0) > 0 is the best, that is, best asymptotically 
normal(BAN). 


Looking at the fact that in a regular estimation case, based on n iid observations, the mini- 


mum attainable variance for an unbiased estimator of y(0) is LL where /(0) is the Fisher's 


information based on a single observation, Fisher argued that the minimum attainable value 
of c?(0) is SM Consequently, Fisher defined T, as asymptotically efficient/optimal if 
a0) = au holds for every 0. The following counter example of Hodges & Lecam reveals 


that such an estimator does not, in general, exists. 


If X is the sample mean based on n iid observations from a normal population with mean 
0 and variance unity, then /n(X — 0) = N(0,07(0)), where o(@) = 1. Again we find that 
I(0) = 1 and hence c?(0) = I(0) V 0. Thus according to Fisher, X is asymptotically effi- 
cient. Now define another estimator Ta = aXI(|X| < n-V/4^) + XI(|X| > n-!/5), for some 
a € (0,1). Then it can be shown that /n(T, — 0) > N(0,72(0)) with 7?(0) = a? or 1 as 
0 — 0 or 1. Thus 7(0) < o(0). Hence, asymptotically efficient estimators does not exist in 


the sense of Fisher. 


3.2.3 Pitman’s notion of asymptotic efficiency 


Consider two sequence of estimators T1, and T», for y(@) such that 


Vn(Tin — 4(8)) 3 N(0,7?), and 


Vn (Toy — (0)) 5 N(0, 7?), 


where Tow is based on n’ = n'(n) observations. Then asymptotic relative efficiency ( ARE) of 
Tin with respect to 75, is defined as ARE(1|2) = limpo se) provided the limit exists and 
is independent of the subsequence (n'(n)). Naturally, Ti, is better or worse asymptotically 
according as ARE(1|2) > 1 or ARE(1|2) < 1. 


However, if 
Vn(Tis — 1(8)) ^ N(0, 0%) 
for k = 1,2, then ARE(1|2) can be expressed as 03/07. 


Let us consider an example to make the development clear. Consider comparison of 
sample mean and median as an estimator of a location parameter. To be specific, let X; be 
iid observations from a DF F(x — 0), where F(x) + F(—x) = 1Vz with finite variance c?. If 
F'(x) = f(x) is non zero and continuous in an nbd of 6, then 


vn(Ty, — 0) 4 N(0,0?), and 


_ D 1 
Vn(Ton = 0) = N(0, iP 


where T, is the sample mean and T5, is the sample median. Then ARE(2\1) = 4o? f?(0). 
For N(0,1), we find ARE(2|1) = 2/- < 1 whereas for DE(0,1) parent, ARE(2|1) = 2 > 1. 
Thus for double exponential parent, sample median is asymptotically more efficient than the 


sample mean. 


10. 


Large-Sample Inference: Module 11! 


Learn More 


. DasGupta, A. (2008). Asymptotic Theory of Statistics and Probability, Springer. 


. Serfling, R. (1980). Approximation Theorems of Mathematical Statistics, John Wiley, 


New York. 


. Shao, J. (2003). Mathematical Statistics, 2 nd ed. Springer, New York. 
. Van der Vaart, A. W. (2000). Asymptotic Statistics, Cambridge University Press. 
. Lehmann, E.L.(1999). Elements of Large-Sample Theory, Springer, New York. 


. Sen, P.K. and Singer, J.(1993). Large sample methods in statistics, Chapman & Hall, 


New York. 


. Ferguson, T.S.(1996). A course in large-sample theory. Chapman & Hall, New York. 


. Lecam, L. and Yang, G.L.(1990). Asymptotics in Statistics. Springer-Verlag, New 


York. 


. Rohatgi, V.K. and Saleh, A.K.(2002). An introduction to probability and statistics. 


Second Edition, John Wiley & Sons Inc., New York. 


Tom M. Apostol(1974). Mathematical Analysis. Addison-Wesley Publishing Company, 


Inc.. 


1 Co-Ordinator: Dr. Rahul Bhattacharya, Department of Statistics, Calcutta University 


1 


MHRD 


ovt. of India 


athshala 
éf ) TTS aT 


A Gateway to all Postgraduate Courses An MHRD project under its National Mission on Education through ICT (NME-ICT) 


Subject: Statistics 


Paper: Statistical Inference 
Module: Bayesian Analysis 


MHRD 


Iil Govt. of India 


@Qethshala 
Development Team eS UST 


Principal investigator: Dr. Bhaswati Ganguli, Professor, 
Department of Statistics, University of Calcutta 

Paper co-ordinator: Dr. Dipak K Dey, Associate Dean and BOT 
Distinguished Professor, Department of Statistics, 
University of Connecticut 

Content writer: Dr. Sourish Das, Assistant Professor, Chennai 
Mathematical Institute 


Content reviewer: Department of Statistics, University of Calcutta 


vt. of India 


athshala 
Introduction estet 


Bayesian Analysis 


Bayesian Analysis is a statistical methodology in which Bayes' 
Theorem is used to estimate the probability of a hypothesis as data 
is observed. 


Introduction 


In this introductory module we discuss: 


1. Frequentist and subjective interpretations of probability 


vt. of India 


Ë Jathshala ` 
Introduction © S UTOSITelIT 


In this introductory module we discuss: 


1. Frequentist and subjective interpretations of probability 


2. Axioms of Probability 


vt. of India 


athshala 
Introduction Getestet 


In this introductory module we discuss: 


1. Frequentist and subjective interpretations of probability 
2. Axioms of Probability 


3. Independence and Conditional Probability 


t. of India 


athshala 
Introduction Getestet 


In this introductory module we discuss: 


1. Frequentist and subjective interpretations of probability 
2. Axioms of Probability 
3. Independence and Conditional Probability 


4. Bayes’ Theorem and It's Application 


"MHRD 


. . Ak athshala 
The frequency interpretation ) STS SITeIT 


The probability that some specific outcome is the relative frequency 
with which that outcome would be obtained if the process were 
repeated a large number of times under similar conditions. 


Frequentist and subjective interpretations of probability 


Ë Jathshala 
A 


The frequency interpretation “Q UTSTAT 


The probability that some specific outcome is the relative frequency 
with which that outcome would be obtained if the process were 
repeated a large number of times under similar conditions. 


Example 


The probability of obtaining a head in a fair coin toss is 0.5 
because the relative frequency of heads should be approximately 
0.5 if we flip the coin many times. 


Frequentist and subjective interpretations of probability 


208 J2athshala 
The frequency interpretation «S UTOSITeIT 


When we make statistical inferences from the frequentist 
perspective, we assume that our data is a sample from an entire 
population. 


Frequentist and subjective interpretations of probability 6/29 


Ë Jathshala 
The frequency interpretation ee UST aT 


When we make statistical inferences from the frequentist 
perspective, we assume that our data is a sample from an entire 
population. 


1. The population is described by the population mean and the 
population variance that are unknown. 


Frequentist and subjective interpretations of probability 


Ë Jathshala 
The frequency interpretation ee UST aT 


When we make statistical inferences from the frequentist 
perspective, we assume that our data is a sample from an entire 
population. 


1. The population is described by the population mean and the 
population variance that are unknown. 


2. The sample is described by the sample mean and the sample 
variance. 


Frequentist and subjective interpretations of probability 


a Jathshala 
A 


The frequency interpretation “Q UTSTAT 


When we make statistical inferences from the frequentist 
perspective, we assume that our data is a sample from an entire 
population. 


1. The population is described by the population mean and the 
population variance that are unknown. 


2. The sample is described by the sample mean and the sample 
variance. 


3. The sample mean and variance provide estimates about the 
mean and variance of the entire population. 


Frequentist and subjective interpretations of probability 


Ë Jathshala 
A 


The frequency interpretation “Q UTSTAT 


When we make statistical inferences from the frequentist 
perspective, we assume that our data is a sample from an entire 
population. 


1. The population is described by the population mean and the 
population variance that are unknown. 


2. The sample is described by the sample mean and the sample 
variance. 


3. The sample mean and variance provide estimates about the 
mean and variance of the entire population. 

4. Importantly, these estimates are known only with some 
uncertainty. 


Frequentist and subjective interpretations of probability 


Ë Jathshala 
A 


The frequency interpretation "S UTOSITelIT 


When we make statistical inferences from the frequentist 
perspective, we assume that our data is a sample from an entire 
population. 


1. The population is described by the population mean and the 
population variance that are unknown. 


2. The sample is described by the sample mean and the sample 
variance. 


3. The sample mean and variance provide estimates about the 
mean and variance of the entire population. 
4. Importantly, these estimates are known only with some 
uncertainty. 
Our uncertainty about a statistic like the mean is summarized by 
its sampling distribution. 


Frequentist and subjective interpretations of probability 


MHRD 


ovt. of India 


The frequency interpretation AGstea 
+9 UTSATT 


The sampling distribution 


Sampling Distribution 


The sampling distribution is a probability distribution of all possible 
values of a statistic of interest for samples of size N that could be 
formed for a given population. 


> The observed sample mean is just one realization. 


Frequentist and subjective interpretations of probability 


MHRD 


ovt. of India 


The frequency interpretation diete 
The sampling distribution ~ JUTS AT 


Problem 


Beyond some text book cases finding the exact sampling 
distribution is difficult task. 


Frequentist and subjective interpretations of probability 


MHRD 
ovt. of India 


The frequency interpretation Ë Jathshala 
e 


The sampling distribution 


Problem 


Beyond some text book cases finding the exact sampling 
distribution is difficult task. 


Solution 

Frequentist approach to the problem is to approximate the 
sampling distribution by known distribution like Gaussian or t 
distribution under the assumption like sample size N is large. 


Frequentist and subjective interpretations of probability 


MHRD 


ovt. of India 


The frequency interpretation à Jathshala 
+9 FST eT 


The sampling distribution 


Problem 


Beyond some text book cases finding the exact sampling 
distribution is difficult task. 


Solution 

Frequentist approach to the problem is to approximate the 
sampling distribution by known distribution like Gaussian or t 
distribution under the assumption like sample size N is large. 


Critique 1 


Needless to say, this is a theoretical construct since, with a large 
population, there will be billions of unique samples and it would be 
superior to simply sample the entire population. 


Frequentist and subjective interpretations of probability 


The frequency interpretation Ë Jathshala MHRD 
ry ovt. of India 


The sampling distribution 


Problem 


Beyond some text book cases finding the exact sampling 
distribution is difficult task. 


Solution 

Frequentist approach to the problem is to approximate the 
sampling distribution by known distribution like Gaussian or t 
distribution under the assumption like sample size N is large. 


Frequentist and subjective interpretations of probability 


MHRD 


ovt. of India 


The frequency interpretation à Jathshala 
+9 TST 


The sampling distribution 


Problem 


Beyond some text book cases finding the exact sampling 
distribution is difficult task. 


Solution 


Frequentist approach to the problem is to approximate the 
sampling distribution by known distribution like Gaussian or t 
distribution under the assumption like sample size N is large. 


Critique 2 


P-values refer to the proportion of hypothetical draws from the 
sampling distribution that are consistent with the null hypothesis. 
As p-values are based on the concept of a sampling distribution, do 
they make sense if our data contains the almost entire population? 


Frequentist and subjective interpretations of probability 


e Jathshala 
The Classical Interpretation of Probability Sè HY WTOSITelT 


1. The classical interpretation is based on the concept of equally 
likely outcomes. 


Frequentist and subjective interpretations of probability 10 / 29 


Ak athshala 
The Classical Interpretation of Probability ) STS RIT AT 


'"MHRD 


1. The classical interpretation is based on the concept of equally 
likely outcomes. 


2. If the outcome of some process must be one of n different 


outcomes, and if these outcomes are equally likely to occur, 
then the probability of each outcome is H, 


Frequentist and subjective interpretations of probability 


Ë Jathshala 
The Classical Interpretation of Probability Sè HY WTOSITelT 


1. If we flip a fair coin, the probability of a head would be 5 
because head and tail are equally likely outcomes. 


Frequentist and subjective interpretations of probability 


Ak athshala 
The Classical Interpretation of Probability ) STS RIT AT 


MHRD 
JJ Govt. of India 
Oe) 


bm 


1. If we flip a fair coin, the probability of a head would be 5 
because head and tail are equally likely outcomes. 


2. The classical approach offers an appealing summary of 
uncertainty in a one-shot situation. 


Frequentist and subjective interpretations of probability 


. . @Qathshala 
Problems with the Classical Interpretation *e'S YTOSITeIT 


'"MHRD 


The drawback of the classical interpretation is that the concept of 
equally likely outcomes is itself probabilistic. 


1. In a sense, this makes the classical definition of probability 
circular. 


Frequentist and subjective interpretations of probability 


. . e Jathshala 
Problems with the Classical Interpretation *e'S YTOSITeIT 


MHRD 
J]!N Govt. of India 
E0, 


The drawback of the classical interpretation is that the concept of 
equally likely outcomes is itself probabilistic. 


1. In a sense, this makes the classical definition of probability 
circular. 


2. Furthermore, the concept begins to break down in contexts 
other than gambling when events are not equally likely. 


Frequentist and subjective interpretations of probability 


d D Jathshala 
The classical response is... s 9 UTOSITeIT 


1. Laplace's Rule of Insufficient Reason: in the absence of 
compelling evidence to the contrary, we should assume that 
events are equally likely. 


Frequentist and subjective interpretations of probability 13 / 29 


Ë Jathshala 
The classical response is... ee UST 


1. Laplace's Rule of Insufficient Reason: in the absence of 
compelling evidence to the contrary, we should assume that 
events are equally likely. 


2. This concept response is actually more useful to Bayesians 
when defending their priors. 


Frequentist and subjective interpretations of probability 13 / 29 


athshala 
The Subjective Interpretation of Probabiiit A ) TO SITeIT 


MHRD 
JJ Govt. of India 
GU23 


aia dh 


1. The probability that a person assigns to a possible outcome of 
some process represents his or her own judgment of the 
likelihood that the outcome will be obtained. 


Frequentist and subjective interpretations of probability 


athshala 
The Subjective Interpretation of Probabilit AN 9 uTóSITelT 


MHRD 
Tn ovt. of India 


1. The probability that a person assigns to a possible outcome of 
some process represents his or her own judgment of the 
likelihood that the outcome will be obtained. 


2. In contrast to the classical and frequentist interpretations of 
probability, this means that different individuals could have 
different probability judgments. 


Example 


If we flip a fair coin, the probability of a head could be i because, 
for some reason, we think that God wants it to be a head. 


Frequentist and subjective interpretations of probability 


d Jathshala 
The Subjective Interpretation of ProbabilitySe’9 QTOSITelIT 


1. Is subjective probability theory really that ad-hoc? 


di Jathshala 
The Subjective Interpretation of Probability WTOSITeIT 


1. Is subjective probability theory really that ad-hoc? 


2. Not Necessarily...Bayesian methodology elicit priors in a 
manner that ensures coherence. 


Frequentist and subjective interpretations of probability 


ie Jathshala 
The Subjective Interpretation of Probability WTOSITeIT 


de Finetti Chapter 1 


"This being granted, once an individual has evaluated the 
probabilities of certain events, two cases present 
themselves: either it is possible to bet with him in such a 
way as to be assured of winning, or else this probability 
does not exist. In the first case, one should say that the 
evaluation of probabilities given by this individual 
contains an incoherence, an intrinsic contradiction; in the 
other we say the individual is coherent. It is precisely this 
condition of coherence which constitutes the sole 
principle from which one can deduce the whole calculus 
of probability". 


Frequentist and subjective interpretations of probability 


athshala 
The Subjective Interpretation of Probabilit A 9 uTO SITeIT 


Extensions of de Finetti's axioms form the basis of subjective 
expected utility theory. Later chapters of the book introduce the 
concept of exchangeability, which is rather important to probability 
theory. 


Frequentist and subjective interpretations of probability 


Ak athshala 
Axioms of Probability 9 uTóSITelT 


t. of India 


A probability distribution on a sample space S is a specification of 
numbers Pr(A;) which satisfy: 


Axiom 1 


For any outcome A;, Pr(A;) > 0. 


‘Axioms of Probability 


Ak athshala 
Axioms of Probability 9 uTóSITelT 


A probability distribution on a sample space S is a specification of 
numbers Pr(A;) which satisfy: 


Axiom 1 
For any outcome A;, Pr(A;) > 0. 


Axiom 2 
PASJ= L 


Ë Jathshala i; MHRD 
^h Q UTSI AT t. of India 


Axioms of Probability 


A probability distribution on a sample space S is a specification of 
numbers Pr(A;) which satisfy: 


Axiom 1 
For any outcome A;, Pr(A;) > 0. 


Axiom 2 
Pr Sat 


Axiom 3 


For a sequence of disjoint events A, A», ... 


PUL A) = S PA) 
p 


It turns out that each of these three axioms can be justified using 
the coherence criterion. 


Theorems of Probability 


1. P(®) =0 


Theorems of Probability 


1. P(®) 20 


2. For any finite sequence of disjoint events ( A1, A2, ..., An} 


Ui, Ai) =Y Pra 


Ë Jathshala RD 
Theorems of Probability SHST ay” 


1. P(®) =0 


2. For any finite sequence of disjoint events ( A1, A2, ..., An} 


Ui, Ai) =Y Pra 


3. For any event A, Pr(A^) = 1— Pr(A) 


Ak athshala RD 
Theorems of Probability JUTOSITeIT aay" 


1. P(®) =0 


2. For any finite sequence of disjoint events ( A1, A2, ..., An} 
Ui, Ai) =Y Pra 


3. For any event A, Pr(A^) = 1— Pr(A) 


4. For any event A, 0 € Pr(A) € 1 


u Ag athshala & RD 
Theorems of Probability JUTOSITeIT aay" 
1. P(®) 20 


2. For any finite sequence of disjoint events ( A1, A2, ..., An} 
Ui, Ai) =Y Pra 


3. For any event A, Pr(A^) = 1— Pr(A) 


4. For any event A, 0 € Pr(A) € 1 


5. For any two events A and B, 
Pr(AU B) = Pr(A) + Pr(B) — Pr(An B) 


ËR athshala 
Independence Events A UTOSITeIT 


Idea of Independence 


Two events A and B are independent if the occurrence or 
non-occurrence of one of the events has no influence on the 
occurrence or non-occurrence of the other event. 


Independence and Conditional Probability 20 / 29 


GE 


Independence Events 


Idea of Independence 


Two events A and B are independent if the occurrence or 
non-occurrence of one of the events has no influence on the 
occurrence or non-occurrence of the other event. 


Independence Mathematical Definition 


Two events A and B are independent if 


Pr(An B) = Pr(A)Pr(B) 


Independence and Conditional Probability 20 / 29 


Independent Events dose 
Example of Independence WJTOSITeIT 


Are 'Smoking' and 'Lung Cancer’ independent? 
Suppose 


1. Pr(Smoking) — 0.4, 


Clearly, Pr(Lung Cancer) x Pr(Smoking) = 0.4 x 0.5 = 0.2 7 
0.35 = Pr(Lung Cancer N Smoking) 


Independent Events dose 
Example of Independence WJTOSITeIT 


Are ‘Smoking’ and 'Lung Cancer’ independent? 
Suppose 


1. Pr(Smoking) — 0.4, 


2. Pr(Lung Cancer) — 0.5, 


Clearly, Pr(Lung Cancer) x Pr(Smoking) = 0.4 x 0.5 = 0.2 7 
0.35 = Pr(Lung Cancer N Smoking) 


Independent Events dose 
Example of Independence WJTOSITeIT 


Are ‘Smoking’ and 'Lung Cancer’ independent? 
Suppose 


1. Pr(Smoking) = 0.4, 


2. Pr(Lung Cancer) = 0.5, 


3. Pr(Lung Cancer N Smoking) = 0.35 


Clearly, Pr(Lung Cancer) x Pr(Smoking) = 0.4 x 0.5 = 0.2 4 
0.35 = Pr(Lung Cancer N Smoking) 


Independent Events dose 
Example of Independence WJTOSITeIT 


Are ‘Smoking’ and 'Lung Cancer’ independent? 
Suppose 


1. Pr(Smoking) = 0.4, 


2. Pr(Lung Cancer) = 0.5, 


3. Pr(Lung Cancer N Smoking) = 0.35 


Clearly, Pr(Lung Cancer) x Pr(Smoking) = 0.4 x 0.5 = 0.2 4 
0.35 = Pr(Lung Cancer N Smoking) 


MHRD 
J]IN Govt. of India 
E0, 


@Qethshala 
Conditional Probability +9 ITOSITeIT 


Conditional probabilities allow us to understand how the probability 
of an event A changes after it has been learned that some other 
event B has occurred. 


> The key concept for thinking about conditional probabilities is 
that the occurrence of B reshapes the sample space for 
subsequent events. 


Independence and Conditional Probability 


MHRD 


Iil Govt. of India 


Ë Jathshala 
Conditional Probability +9 FST eT 


Conditional probabilities allow us to understand how the probability 
of an event A changes after it has been learned that some other 
event B has occurred. 


> The key concept for thinking about conditional probabilities is 
that the occurrence of B reshapes the sample space for 
subsequent events. 


> That is, we begin with a sample space S 


Independence and Conditional Probability 


MHRD 


Iil Govt. of India 


Ë Jathshala 
Conditional Probability +9 FST eT 


Conditional probabilities allow us to understand how the probability 
of an event A changes after it has been learned that some other 
event B has occurred. 
> The key concept for thinking about conditional probabilities is 
that the occurrence of B reshapes the sample space for 
subsequent events. 


> That is, we begin with a sample space S 


> Aad BES 


Independence and Conditional Probability 


é Jathshala 
A 


Conditional Probability "$ YTOSITeIT 


Conditional probabilities allow us to understand how the probability 
of an event A changes after it has been learned that some other 
event B has occurred. 
» The key concept for thinking about conditional probabilities is 
that the occurrence of B reshapes the sample space for 
subsequent events. 


> That is, we begin with a sample space S 


> Aad BES 


> The conditional probability of A given that B looks just at the 
subset of the sample space for B. 


Independence and Conditional Probability 


Conditional Probability 


vt. of India 


Ë Jathshala 
Conditional Probability © H ITOSITeIT 


1. The conditional probability of A given B is denoted Pr(A|B). 


Independence and Conditional Probability 


vt. of India 


Ak athshala 
Conditional Probability ) STS RIT AT 


1. The conditional probability of A given B is denoted Pr(A|B). 


2. Importantly, according to Bayesian orthodoxy, all probability 
distributions are implicitly or explicitly conditioned on the 
model. 


Independence and Conditional Probability 


Ë Jathshala 
v. 


Conditional Probability 9 uTóSITelT 


1. The conditional probability of A given B is denoted Pr(A|B). 


2. Importantly, according to Bayesian orthodoxy, all probability 
distributions are implicitly or explicitly conditioned on the 
model. 


3. By definition: If A and B are two events such that 
Pr(B) » 0, then 


P(An B) 
P(B) 


P(A|B) = 


Conditional Probability 


S Pr(A|B) - PA 0B 
Pr(B) 


Independence and Conditional Probability 


Conditional Probability 


Example 
1. What is the Pr(Lung Cancer|Smoking)? 


Independence and Conditional Probability 


Conditional Probability 


Example 
1. What is the Pr(Lung Cancer|Smoking)? 


2. Pr(Lung Cancer N Smoking) = 0.35 


Conditional Probability 


Example 
1. What is the Pr(Lung Cancer|Smoking)? 


2. Pr(Lung Cancer N Smoking) = 0.35 


3. Pr(Smoking) = .4 


Ak athshala RD 
Conditional Probability JUTOSITeIT aay 


Example 
1. What is the Pr(Lung Cancer|Smoking)? 


2. Pr(Lung Cancer N Smoking) = 0.35 


3. Pr(Smoking) — .4 


4. Thus, Pr(Lung Cancer|Smoking) = .35/.4 = .875 


Ak athshala RD 
Properties of Conditional Probability JuTOSITeIT d AN 


1. The Conditional Probability for Independent Events: If A 
and B are independent then P(A|B) = P(A) 


Ë Jathshala 
os UTSA AT OR) m 


Properties of Conditional Probability 


1. The Conditional Probability for Independent Events: If A 
and B are independent then P(A|B) = P(A) 


2. The Multiplication Rule for Conditional Probabilities: In 
an experiment involving two non-independent events A and B, 
the probability that both A and B occurs can be found in the 
following two ways: 


Pr(An B) = Pr(B)Pr(A|B) 


or 
Pr(An B) = Pr(A)Pr(B|A) 


Independence and Conditional Probability 


Ë Jathshala 
oS UTOSITeIT GS m 


Properties of Conditional Probability 


1. The Conditional Probability for Independent Events: If A 
and B are independent then P(A|B) = P(A) 


2. The Multiplication Rule for Conditional Probabilities: In 
an experiment involving two non-independent events A and B, 
the probability that both A and B occurs can be found in the 
following two ways: 

Pr(An B) = Pr(B)Pr(A|B) 
or 
Pr(An B) = Pr(A)Pr(B|A) 


3. The set of events | A4, ..., An} are partition of sample space 
S, where Ur; = S, then 


P(B) = $  P(BIA)P(A) 
i=l i 


Independence and Conditional Probability 


a Jathshala i 
Bayes' Theorem SQC d» 


Bayes' Theorem 

Let events Aj,..., Aj, form a partition of the space S such that 
Pr(A;) > 0 for all j and let B be any event such that Pr(B) > 0. 
Then for i = 1,..,k: 


P(B|Ai)P(Ai) 


P(A;|B) = y P(BJA)P(4:) 


Bayes' Theorem and It's Appli 


Ë Jathshala 
o nr ovt. of India 


Bayes' Theorem )UuTOSITeIT dis 


1. Bayes' Theorem is just a simple rule for computing the 
conditional probability of events A; given B from the 
conditional probability of B given each event A; and the 
unconditional probability of each A; 


Bayes' Theorem and It's Application 


a Jathshala 
© S WOM AT 


vt. of India 


Bayes' Theorem 


1. Bayes' Theorem is just a simple rule for computing the 
conditional probability of events A; given B from the 
conditional probability of B given each event A; and the 
unconditional probability of each A; 


2. P(A;) is the prior distribution of Aj. 


a Jathshala 
Bayes' Theorem QUST 


t. of India 


1. Bayes' Theorem is just a simple rule for computing the 
conditional probability of events A; given B from the 
conditional probability of B given each event A; and the 
unconditional probability of each A; 


2. P(A;) is the prior distribution of Aj. 


3. P(B|Ai) is the the conditional probability of B given A; 


@Qethshala 
Bayes’ Theorem QUST 


1. Bayes’ Theorem is just a simple rule for computing the 
conditional probability of events A; given B from the 
conditional probability of B given each event A; and the 
unconditional probability of each A; 


2. P(A;) is the prior distribution of Aj. 
3. P(B|Ai) is the the conditional probability of B given A; 


4. X; P(B|Ai) P(A;) is the normalizing constant 


Bayes' Theorem and It's Appli 


Ë Jathshala 
© S WORM AT 


Bayes' Theorem 


1. Bayes' Theorem is just a simple rule for computing the 
conditional probability of events A; given B from the 
conditional probability of B given each event A; and the 
unconditional probability of each A; 


2. P(A;) is the prior distribution of Aj. 
3. P(B|Ai) is the the conditional probability of B given A; 
4. X; P(B|Ai) P(A;) is the normalizing constant 


5. P(Aj|B) is the posterior probability of A; given B 


Bayes’ Theorem and It's Appli 


athshala 
GR |) OTST 


A Gateway to all Postgraduate Courses An MHRD project under its National Mission on Education through ICT (NME-ICT) 


Subject: Statistics 


Paper: Statistical Inference 
Module: Bayesian Analysis of One Parameter 
Model 


MHRD 


Iil Govt. of India 


@Qethshala 
Development Team eS UST 


Principal investigator: Dr. Bhaswati Ganguli, Professor, 
Department of Statistics, University of Calcutta 

Paper co-ordinator: Dr. Dipak K Dey, Associate Dean and BOT 
Distinguished Professor, Department of Statistics, 
University of Connecticut 

Content writer: Dr. Sourish Das, Assistant Professor, Chennai 
Mathematical Institute 


Content reviewer: Department of Statistics, University of Calcutta 


MHRD 
JJ Govt. of India 
Q2 


athshala 
Introduction etse 


bm 


1. The binomial distribution-uniform prior integration tricks 


2. Posterior Interpretation 


3. Binomial distribution-beta prior 


A 


. Conjugate priors and sufficient statistics 


à Jathshala 
oS UTOSITeIT 


Bayesian Setup 


Modeling the Unknown Quantities 
From the Bayesian perspective, there are known and unknown 
quantities. 

> The known quantity is the data, denoted D. 


> The unknown quantities are the parameters (e.g. mean, 
variance, missing data), denoted 0. 
To make inferences about the unknown quantities, we stipulate a 
joint probability function that describes how we believe these 
quantities behave in conjunction, p(0, D). 


4/35 


e Jathshala 
v. 


Bayesian Setup WTOSITeIT 


> Using Bayes’ Rule, this joint probability function can be 
rearranged to make inference about 0 


p(0|D) 


> L(0|D) is the likelihood function of 0 


Ë Jathshala 
v. 


Bayesian Setup 9 WTSI aT 


> Using Bayes’ Rule, this joint probability function can be 
rearranged to make inference about 0 


p(0|D) 


> L(0|D) is the likelihood function of 0 


> f L(0|D)p(0)d0 is the normalizing constant or prior predictive 
distribution 


Ë Jathshala 
SIUSA ay" 


Bayesian Setup 


> Using Bayes’ Rule, this joint probability function can be 
rearranged to make inference about 0 


p(0|D) 


> L(0|D) is the likelihood function of 0 


> f L(0|D)p(0)d0 is the normalizing constant or prior predictive 
distribution 


> It is the normalizing constant because it ensures that the 
posterior distribution of 0 integrates to one. 


Ak athshala 
Bayesian Setup 9 uTóSITelT 


> It is the prior predictive distribution because it is not 
conditional on a previous observation of the data-generating 
process (prior) and because it is the distribution of an 
observable quantity (predictive). 


t. of India 


a Jathshala 
o A UTS 9ITelT 


Bayesian Setup 


> It is the prior predictive distribution because it is not 
conditional on a previous observation of the data-generating 
process (prior) and because it is the distribution of an 
observable quantity (predictive). 
Popular Presentation 


The posterior distribution often presented as 
»(0|D) « L(0|D)p(0) 


i.e., posterior « likelihood x prior 


t. of India 


Ë Jathshala i; 
SUT SIT aT 


Bayesian Setup 


> It is the prior predictive distribution because it is not 
conditional on a previous observation of the data-generating 
process (prior) and because it is the distribution of an 
observable quantity (predictive). 


Popular Presentation 


The posterior distribution often presented as 
»(0|D) « L(0|D)p(0) 
i.e., posterior « likelihood x prior 


» Why are we allowed to do this? 


» Why might not be as useful? 


a Jathshala 
Bayesian Analysis: Binomial Distribution QUST 


> Suppose Xj, X»,...,. X4, are independent random draws from 
same Bernoulli distribution with parameter a (unknown). 


Ag athshala 
Bayesian Analysis: Binomial Distribution ) STS RIT AT 


t. of India 


> Suppose Xj, X»,...,. X, are independent random draws from 
same Bernoulli distribution with parameter a (unknown). 


> Thus X; ~ Bernoulli(z) for i € (1,2, ..., n) or equivalently 
Ye Xi ~ Bin(n, r). 


Ë Jathshala 
Bayesian Analysis: Binomial Distribution QUITS eT 


t. of India 


> Suppose Xj, X»,...,. X4, are independent random draws from 
same Bernoulli distribution with parameter a (unknown). 


> Thus X; ~ Bernoulli(z) for i € (1,2,...,n] or equivalently 
Ye Xi ~ unnm). 


> The joint distribution of Y and m is the product of the 
conditional distribution of Y and the prior distribution . 


athshala 


dd 0 
Bayesian Analysis: Binomial Distribution O 


> Suppose X1, X»,...,.X,, are independent random draws from 
same Bernoulli distribution with parameter a (unknown). 


> Thus X; ~ Bernoulli(z) for i € (1,2,...,n] or equivalently 
Ye Xi ~ unnm). 


> The joint distribution of Y and s is the product of the 
conditional distribution of Y and the prior distribution 7. 


» What distribution might be a reasonable choice for the prior 
distribution of 7? Why? 


@Qethshala : 
Bayesian Analysis: Binomial Distribution QUST 


t. of India 


> If Y Bin(n,7), a reasonable prior distribution for p must be 
bounded between zero and one. 


One option is the uniform distribution m ~ Uni f(0, 1). 


Ak athshala ` 
Bayesian Analysis: Binomial Distribution ) STS RIT AT 


t. of India 


> If Y Bin(n,7), a reasonable prior distribution for p must be 
bounded between zero and one. 


One option is the uniform distribution m ~ Uni f(0, 1). 


play px Con" qm ” x1 


Ak athshala 
Bayesian Analysis: Binomial Distribution ) STS RIT AT 


t. of India 


M 


If Y Bin(n, m), a reasonable prior distribution for p must be 
bounded between zero and one. 


One option is the uniform distribution m ~ Uni f(0, 1). 


pny yo" Con" a ay x1 


v 


As it happens, this is a proper posterior density function. 


» How can you tell? 


a Jathshala 
Bayesian Analysis: Binomial Distribution © H) ITOSITeIT 


Let Y ~ Bin(n, r) and « ~ unif (0, 1) 
> 
pis) x ”Cyr”(1— r)” x1 


x m"(1—m)"? 


Ak athshala 
Bayesian Analysis: Binomial Distribution ) STS RIT AT 


vt. of India 


Let Y ~ Bin(n,7) and m ~ unif(0, 1) 
» 
pis) x - CU Lem "x1 


x m"(1—m)'"? 


» You cannot just call the posterior a binomial distribution 
because you are conditioning on Y and 7 is a random 
variable, not the other way around. 


@Qathshala 
Bayesian Analysis: Binomial Distribution QUST 
Let Y ~ Bin(n,7) and m ~ unif (0,1) 
> 
plat) x - CU Lem "x1 


x m(1l—n? 4 


> You cannot just call the posterior a binomial distribution 
because you are conditioning on Y and 7 is a random 
variable, not the other way around. 


> The pdf of beta distribution which is known to be proper is: 
I'(a RÀ B) ge! 
P(a)P(S) 
where0<2<landa>0O, 6>0 

Note that T (k) is the Gamma function. 


Beta(x|o, 8) = isa 


athshala 


: HRD 
Bayesian Analysis: Binomial Distribution D ICTU o ANN 


Let Y ~ Bin(n,-) and m ~ unif(0, 1) 
» 
p(n|Y) x "C,n"(1—m-)"" x1 


x m"(1—m)'? 


» The pdf of beta distribution which is known to be proper is: 
Dla B): a1 8-1 
————q l-z 

rere)” 4-9 
where0<xz<landa>0,8>0 

Note that T (k) is the Gamma function. 


Beta(z|a, 8) = 


10 / 35 


athshala 


RD 
Bayesian Analysis: Binomial Distribution D TU d» 


Let Y ~ Bin(n,7) and m ~ unif (0,1) 


> 


p(n|Y) x "C,n"(1—m-)"" x1 


x m"(1—m)'? 


» The pdf of beta distribution which is known to be proper is: 
Dla B): a1 8-1 
Beta(x|a,8) = ————a^" ‘(1-2 
Mac M 
where0<2<landa>0O, 6>0 
Note that T (k) is the Gamma function. 


>» Letr=7,a~a=Y+1landG=n—-Y+1 


10 / 35 


Ak athshala $ RD 
Bayesian Analysis: Binomial Distribution JUTOSITeIT aay" 


Let Y ~ Bin(n,7) and m ~ unif (0,1) 
> 
p(n|Y) x "C,n"(1—m-)"" x1 


x m"(1—m)'? 


» The pdf of beta distribution which is known to be proper is: 


Dio 8) aa 8-1 
Beta(z|o,8) = ———~2* “(1-2 
where0<2<landa>0O, 6>0 
Note that T (k) is the Gamma function. 


> Ltr=r,a =Y +1land8ß=n-Y +1 


> Thus p(z|Y = y) ~ Beta(y- 1,01 —y--1) — 


T(n+2 = n— = 
neat Ce 


10 / 35 


athshala 


RD 
Bayesian Analysis: Binomial Distribution eder e» 


Let Y ~ Bin(n,7) and m ~ unif (0,1) 
> 
pix) x Ca Laan) x1 


x m"(1—m)'? 


Note that T (k) is the Gamma function. 
>» Letxc=7,a=Y+1landG=n-YH1 


> Thus p(r|Y = y) ~ Beta(y-- 1,01 —9y 4-1) — 
Tl(n--2) = n— = 
ACESS GEES Ka ia 
T(n+2) 
» Note that TUDI nyti 


transform T” (1 — x)" 


is the normalizing constant to 


11/35 


a Jathshala 
Application- The Cultural Consensus Model *e 9 UToSITelT 


Description 

A researcher examined the level of consensus denoted 7 among 

n — 24 Guatemalan women about whether or not polio (as well as 
other diseases) was thought to be contagious. In this case, 17 
women said polio was contagious. 


athshala 


Ld fa 
Application- The Cultural Consensus Model e rir 


MHRD 


Iil Govt. of India 


Description 

A researcher examined the level of consensus denoted 7 among 

n — 24 Guatemalan women about whether or not polio (as well as 
other diseases) was thought to be contagious. In this case, 17 
women said polio was contagious. 


> Let X; = 1 if respondent i thought polio was contagious and 
Xj — 0 otherwise 


@Qethshala 
Application- The Cultural Consensus Model *e'$ 


vt. of India 


WTOSITeIT 


Description 


A researcher examined the level of consensus denoted 7 among 

n — 24 Guatemalan women about whether or not polio (as well as 
other diseases) was thought to be contagious. In this case, 17 
women said polio was contagious. 


> Let X; = 1 if respondent į thought polio was contagious and 
Xj = 0 otherwise 


> Let 7? 4X; =Y ~ Bin(24,7) and 7 ~ Unif (0,1) [] 


@Qathshala Wb MHRD 
Application- The Cultural Consensus Model ‘9 UT6&iTeT = 


Description 


A researcher examined the level of consensus denoted 7 among 

n = 24 Guatemalan women about whether or not polio (as well as 
other diseases) was thought to be contagious. In this case, 17 
women said polio was contagious. 


> Let X; = 1 if respondent į thought polio was contagious and 
Xj = 0 otherwise 


> Let Y? ,AX; =Y ~ Bin(24,7) and 7 ~ Uni f (0,1) [] 
> p(r|Y,n) ~ Beta(y - 1,n — y 1) 


> Substitute n = 24 and Y = 17 we have the posterior 


distribution as 
p(m|Y,n) = Beta(18,8) 


athshala 
Application- The Cultural Consensus Model eder 


Posterior density plot of 7: 


Posterior density 
2 


208 Jathshala 
Posterior Summaries T A UTOSITeIT 


The full posterior contains too much information, especially in 
multi-parameter models. So, we use summary statistics (e.g. 
mean, var, HDR). 


14 / 35 


a Jathshala 
Posterior Summaries € S OTST AT 


The full posterior contains too much information, especially in 
multi-parameter models. So, we use summary statistics (e.g. 
mean, var, HDR). 


» Methods for generating summary stats: 


» Analytical Solutions: use the well-known analytic solutions for 
the mean, variance, etc. of the various posterior distribution. 


14/35 


Ë Jathshala 
A uT6SITeIT 


Posterior Summaries 


The full posterior contains too much information, especially in 
multi-parameter models. So, we use summary statistics (e.g. 
mean, var, HDR). 


» Methods for generating summary stats: 


» Analytical Solutions: use the well-known analytic solutions for 
the mean, variance, etc. of the various posterior distribution. 


» Numerical Solutions: use a random number generator to draw 


a large number of values from the posterior distribution, then 
compute summary stats from those random draws. 


14/35 


vt. of India 


Ë Jathshala 
Analytic Summaries of the Posterior HY ITOSITeIT 


» Analytic summaries are based on standard results from 
probability theory 


> Continuing our example, p(z|Y, n) = Beta(18,8) 


15 / 35 


. . Ak athshala RD 
Analytic Summaries of the Posterior JUTOSITeIT aay" 


> Analytic summaries are based on standard results from 
probability theory 


> Continuing our example, p(z|Y,n) = Beta(18, 8) 


> If 0 ~ Beta(o, B) 


Var(0) = GP SERD Var(z|Y, n) = (18+8)2(184+841) = 0.01 
Mode(6 z1 Mode(z|Y, n) = -= = 0.71 
= a 2 184-8—2 


15/35 


Numerical Summaries of the Posterior 


» To create numerical summaries from the posterior, you need a 
random number generator. 


16 / 35 


08 Jathshala 
Numerical Summaries of the Posterior © 9 UTSATT 


» To create numerical summaries from the posterior, you need a 
random number generator. 


> To summarize p(v|Y, n) ~ Beta(18,8) 
> Draw large number of random samples from Beta(18, 8) 
distribution. 


16 / 35 


athshala 
Numerical Summaries of the Posterior Getestet 


» To create numerical summaries from the posterior, you need a 
random number generator. 


> To summarize p(z|Y, n) ~ Beta(18,8) 
> Draw large number of random samples from Beta(18, 8) 
distribution. 


> Calculate the sample statistics from that set of random 
samples. 


16 / 35 


A al UToSITelT t. of India 


a Jathshala 
o 


Numerical Summaries of the Posterior 


Min. ist Qu. Median Mean 3rd Qu. Max. 
0.4045 0.6311 0.6916 0.6903 0.7539 0.9065 


Histogram of pai 


"t 
>= m 
o= 
€ 
[7] 
a 
= 
e 


17 / 35 


à Jathshala 
Credible Intervals or Highest Posterior Dense S VTOSITelT 


Highest Density Regions (HDR's) are intervals containing a 
specified posterior probability. The figure below plots the 9596 
highest posterior density region. 


© + 
o 
[en] 
o oo 
Ke] 
p 
9 N 
ho 
[0] 
I7 r 
O 
Q 
O 


Confidence Intervals vs. G^ athshala 
Bayesian Credible Intervals 9 WTSI aT 


Bayesian credible interval 


The Bayesian credible interval is the probability that a true value 
of 0 lies in the interval. Technically, 


P(0 € Interval| Data) =a 


Note that here probability statement is direct. 


19 / 35 


Confidence Intervals vs. Ë athshala 
Bayesian Credible Intervals a 


Bayesian credible interval 
The Bayesian credible interval is the probability that a true value 
of 0 lies in the interval. Technically, 


P(0 € Interval| Data) = a 


Note that here probability statement is direct. 


Frequentist Confidence interval 


The Frequentist Confidence interval is the region of sampling 
distribution for 0 such that given the observed data one would 
expect (1-a) percent of the future estimates of 0 to be outside 
that interval. Note that here understanding of probability is 
implicit. It is not a direct probability statement. 


19 / 35 


Confidence Intervals vs. 
Bayesian Credible Intervals 


'"MHRD 


ovt. of India 


» But often the results appear similar. 


Govt. of India 


Confidence Intervals vs. G^ athshala 
Bayesian Credible Intervals 9 uTóSITelT 


» But often the results appear similar. 


» |f Bayesians use "non-informative priors" and there is a large 
number of observations, often several dozen will do, HDRs 
and frequentist confidence intervals will coincide numerically. 


MHRD 


ovt. of India 


Confidence Intervals vs. ËR athshala 
Bayesian Credible Intervals al 


» But often the results appear similar. 


> |f Bayesians use "non-informative priors” and there is a large 
number of observations, often several dozen will do, HDRs 
and frequentist confidence intervals will coincide numerically. 


» We will talk more about this when we cover the great p-value 
debate, but this is only a coincidence. 


» The interpretation of the two quantities are entirely different. 


vt. of India 


: . mE i ÈR athshala 
Returning to the Binomial Distribution 9 uTóSITelT 


v 


If Y ~ Bin(n, 7), the uniform prior is just one of an infinite 
number of possible prior distributions. 


What other distributions could we use? 


v 


v 


A reasonable alternative to the unif(0,1) distribution is the 
beta distribution. 


> Can you show that Beta(1,1) is a uniform(0,1) distribution? 


MHRD 


ovt. of India 


Prior Consequences Ë athshala 
Plots of 4 Different Beta Distributions GTI 


Beta(5,5) Beta(3,10) 
> o | 
Sd E 
4 o] 
N 
2a =| 
bd 9j 
o | e] 
aiig T T T T T BEST T T T T T 
00 02 04 06 08 1.0 00 02 04 06 08 1.0 
pai pai 
Beta(10,3) Beta(100,30) 
=| o | 
>o | E 
E 
m: e -] 
$4 o - 
e | SM 
=] ad 
Qd 4 
SURE T T T T T T T T T T T 
00 02 04 06 08 1.0 00 02 04 06 08 1.0 


Binomial Distribution Si^ athshala 
with Beta Prior 3 TIE 


> If Y ~ Bin(n,7) and m ~ Beta(a, B) 


» The posterior distribution: 
"Cyr¥(1 m)"-Vx T(a+f) mold q)P-1 
p(r|Y, n) = 


T(a)T (8) 


DEC n)"- x wem m)8-1dm 


MHRD 


ovt. of India 


Binomial Distribution eder 
e 


with Beta Prior UTOSITelT 


> If Y ~ Bin(n,«) and m ~ Beta(a, B) 


> The posterior distribution: 
^ C.ynV (1 m)"-Vx T(a+f) mold q)P-1 


vea = KORA) 
piade m Jo "Cyr! Lam) x SEE n7 (Lan) Pld 
Note 


This is a very complicated integral in the denominator. Though 
this particular integral can be solved; but we will pretend that it is 
difficult integral and we shall use a standard trick in the Bayesian 
toolbox to solve this problem. 


Binomial Distribution 
with Beta Prior 


Trick: 
> Find some multiplicative constant c such that f(y) «c = 1. 


i.e., try to transform f(y) into a well-known pdf. 


Binomial Distribution a athshala 
with Beta Prior 3 TIE 


Trick: 
> Find some multiplicative constant c such that f(y) «c = 1. 
i.e., try to transform f(y) into a well-known pdf. 


> Multiply c and c^! 


Binomial Distribution a athshala 
with Beta Prior al 


Trick: 
> Find some multiplicative constant c such that f(y) «c = 1. 


i.e., try to transform f(y) into a well-known pdf. 
> Multiply c and c^! 


> Since cx f (y) = 1, the original numerator multiplied by c^! is 
the posterior distribution. 


The posterior predictive distribution 


Ta + B) ne! 
Tr(a) (8) 


(1— 2)? 1dr 


1 
i@= n "Cus! (1 — r)” x 


The posterior predictive distribution 


Ta + b) 4971 
P(a)P(8) 


(1 — 2)? 1dr 


1 
fy) = | "cya - ny 


» |t can be expressed as: 


_ T(n + Dr(a + B) : a— nd8—y— 
Nea a cene), epe 


The posterior predictive distribution 


T(a + B) ne! 


1 
Oe n "Cyr m x e 


» |t can be expressed as: 


_ T(n + Dr(a + B) : a— nd8—y— 
=e ee, epe 


> h nYta=1(1 — q)nt8-9-1dg is the kernel of the beta 
distribution 


The posterior predictive distribution 


T(a + B) ne! 


1 
Oe n "Cyr m x e 


> |t can be expressed as: 


— I'(n * l(a B) : a— _\n+8—y— 7 
descr prem), 7 o oeque 


> h nYta=1(1 — q)nt8-9-1dg is the kernel of the beta 
distribution 


ies T'(n 4- 1)T (a + 8) l(y-Ee)ln 4 B—y) 
I= y+ Dn- y+ DA) Tlatn+8) 


This is called beta-binomial distribution 


Ei Jathshala j 
The Posterior of Binomial Model with Betas S ureésmerr dà 


> As the marginal distribution is: 


I'(n + 1)T(a + £) Ty t o)T'(n * 8 — y) 
TI(y--l)T(n—- y-1)T(o)T(8) Tla+n+ f) 


f(y) = 


athshala f 
The Posterior of Binomial Model with Be AQ s MSAT dis 


> As the marginal distribution is: 


I'(n + 1)T(a + £) Ty t o)T'(n * 8 — y) 
TI(y--l)T(n—- y-41)T(o)T(B) | T(a-4 nd f) 


f(y) = 


> The posterior distribution is : 


f(yln)f(n) 


ao fy) 


athshala f 
The Posterior of Binomial Model with Betak WTOSITeIT dip" 


> As the marginal distribution is: 


I'(n + 1)T(a + £) Ty t o)T'(n * 8 — y) 
TI(y--l)T(n—- y-41)T(o)T(B) | T(a-4 nd f) 


f(y) = 


> The posterior distribution is : 


_ flr) fm) 
P(n41) av (1— r)” T(a+8) au T r)! 
_ TOFD ny) T(oT (B) ^ 


d Jathshala 
The Posterior of Binomial Model with Beta*e'S YTOSITeIT 


> The posterior distribution is : 


flm) f (v) 

f(y) 

T(n4+1 n—y (ats a - 
Tun egg t! - 7) em — 07 


f(y) 


fly) = 


fly) = 


di J2athshala 
The Posterior of Binomial Model with Ber WJTOSITeIT 


> The posterior distribution is : 


Jum — 


T(n41) v(1 n—y T(a+8) ~a-1 1— 45-1 
FgrprG-yn^ 0-7) Fare) (1— m) 
f Gcly) — (y 1) (n—y--1) 


» Simplifying the above expression: 


= I'(o 4- n + 8) a—l(i. _ ~\n+8-y-1 
OW =Tytanmta-y OO 


> This is Beta(y + o, n — y+ B) distribution. 


athshala 
The Posterior of Binomial Model with Be OTS QT eT 


vt. of India 


Note 

You can see posterior distribution has the same distribution as 
prior distribution updated by new data. In general, when this 
happens we say the prior is conjugate. 


"Oathshala Wb MHRD 


Hie 
The Posterior of Binomial Model with Beta WTOSITelT IUE 


Note 

You can see posterior distribution has the same distribution as 
prior distribution updated by new data. In general, when this 
happens we say the prior is conjugate. 


Application 


Lets continue to the previous example. You remeber 17 of 24 
women say polio is contagious (so y — 17 and n — 24 where y is a 
realization from binomial) and you use Beta(5,5) prior; the 
posterior distribution is Beta(17 + 5,24 — 17 + 5) = Beta(22,12) 


28 / 35 


Ë Jathshala 
DN WJTOSITeIT OR tt 


Application 


density 
3 


a 


a Jathshala §j 
SIUSA ay" 


Application 


LO — Prior 
— Posterior 
— likelihood 


density 
3 


a Jathshala 
Consequence of Different Priors SIUSA ey" 


co 
ad —  Prior(5,5) 
4 ——— Posterior( 
E —— likelihood 
2 o 
[9] 
"oO 
N 
e 


Ë Jathshala 
Consequence of Different Priors SIUSA ey" 


co 
ad ——  Prior(10,3) 
4 ——— Posterior(27, 
E — likelihood 
2 o 
[9] 
"oO 
N 
e 


Ëk athshala 
Consequence of Different Priors HUST dis" 


co 
ab —— Prior(3, 
4 = ea 
2 E erii 
2 o NA 
oO 
xe) 
e 


ai 


@Qethshala 
Consequence of Different Priors © QUST 


density 


@Qethshala 
Consequence of Different Priors © QUST 


Prior(100,30) 


— Posterior(117,3 
— likelihood 


density 


athshala 
GR |) OTST 


A Gateway to all Postgraduate Courses An MHRD project under its National Mission on Education through ICT (NME-ICT) 


Subject: Statistics 


Paper: Statistical Inference 
Module: Bayesian Analysis of Normal 
Distribution - Part 1 


MHRD 


Iil Govt. of India 


@Qethshala 
Development Team oS UST AT 


Principal investigator: Dr. Bhaswati Ganguli, Professor, 
Department of Statistics, University of Calcutta 

Paper co-ordinator: Dr. Dipak K Dey, Associate Dean and BOT 
Distinguished Professor, Department of Statistics, 
University of Connecticut 

Content writer: Dr. Sourish Das, Assistant Professor, Chennai 
Mathematical Institute 


Content reviewer: Department of Statistics, University of Calcutta 


d ere athshala '"MHRD 


Introduction ~ S UTOSITelIT 


Bayesian Analysis of Mean 


Bayesian estimation of the mean when variables are normally 
distributed with known variance. 


Benchmark 


» Benchmark: The maximum likelihood estimate for the normal 
distribution known variance 


d D Jathshala 
Benchmark $$ UTOSITeIT 


» Benchmark: The maximum likelihood estimate for the normal 
distribution known variance 


> Let y; ^? N(p,l) 445 1,2,3,..,n 


t. of India 


athshala 
Benchmark GGsteen 


» Benchmark: The maximum likelihood estimate for the normal 
distribution known variance 


> Let y; ^? N(u,1) 442 1,2,3,..,n 


» The log-likelihood function: 


athshala 
Benchmark GGsteen 


» Benchmark: The maximum likelihood estimate for the normal 
distribution known variance 


> Let y; ^? N(u,1) 442 1,2,3,..,n 
» The log-likelihood function: 


log L(u|y) = 2nm — $ 3-((vi — uy) 


e Jathshala 
© S WOM AT 


Benchmark 


» Benchmark: The maximum likelihood estimate for the normal 
distribution known variance 


> Let y; ^ N(u,1) ,¢=1,2,3,...,n 
> The log-likelihood function: 


log L(u|y) = 2nm — $ {(yi — uy) 


> To find MLE take the derivative of log-likelihood with respect 
to u and set the equation to 0 to solve for u. 


e Jathshala 
Benchmark SHST ay” 


» Benchmark: The maximum likelihood estimate for the normal 
distribution known variance 


> Let y; ^ N(u,1) ,¢=1,2,3,...,n 
> The log-likelihood function: 


log L(u|y) = 2nm — $ {(yi — uy) 


> To find MLE take the derivative of log-likelihood with respect 
to u and set the equation to 0 to solve for u. 
ie., 


ð 
5, 08 Luly) = 0 
u 


> MLE: i= Ya yi/n 


Normal Model with known Variance 


> Lets consider simple case when sample size n = 1. 


Normal Model with known Variance 


> Lets consider simple case when sample size n = 1. 


> Suppose y ~ N (pu, 06) 


Normal Model with known Variance 


> Lets consider simple case when sample size n = 1. 


> Suppose y ~ N (pu, 06) 
> where og is known constant 


Normal Model with known Variance 


» Lets consider simple case when sample size n — 1. 


> Suppose y ~ N (pu, 06) 
> where og is known constant 


> p is unknown 


S ELTE athshala 


Normal Model with known Variance D UOT 


> Lets consider simple case when sample size n = 1. 


> Suppose y ~ N (p, 06) 
> where og is known constant 


> p is unknown 
» What would be a good prior to choose for u? 


208 Jathshala 
Normal Model with known Variance S UTOSITeIT 


> Lets consider simple case when sample size n = 1. 


> Suppose y ~ N (pu, 06) 
> where og is known constant 


> p is unknown 
> What would be a good prior to choose for u? 


» What properties should the prior have? 


Ei Jathshala 
A reasonable choice for the prior : Normal *v'S WTOSITelIT 


» Why is the normal distribution a good choice? 


d Jathshala 
A reasonable choice for the prior : Normal *e'$ UTOSITelT 


» Why is the normal distribution a good choice? 


> Because the normal distribution is quite flexible... 


ak athshala 
A reasonable choice for the prior : Normal ) ITS RIT ST 


» Why is the normal distribution a good choice? 


> Because the normal distribution is quite flexible... 
> The prior mean can take any value along the real line 


athshala 
A reasonable choice for the prior : Normal GR ) ITS RIT ST 


> Why is the normal distribution a good choice? 


> Because the normal distribution is quite flexible... 
> The prior mean can take any value along the real line 


> The prior variance can be chosen so that our prior beliefs are 
(or are not) influential determinants of the posterior mean and 
variance. 


ik Jathshala 
A reasonable choice for the prior : Normal Qk WTOSITelIT 


» Why is the normal distribution a good choice? 


» Because the normal distribution is quite flexible... 
> The prior mean can take any value along the real line 


> The prior variance can be chosen so that our prior beliefs are 
(or are not) influential determinants of the posterior mean and 
variance. 


> Because it may be a close representation of out our prior 
beliefs. 


athshala 


dd 0 
A reasonable choice for the prior : Normal e ETT 


» Why is the normal distribution a good choice? 


» Because the normal distribution is quite flexible... 
> The prior mean can take any value along the real line 


> The prior variance can be chosen so that our prior beliefs are 
(or are not) influential determinants of the posterior mean and 
variance. 


» Because it may be a close representation of out our prior 
beliefs. 


> if our prior beliefs are symmetric and uni-modal about some 
point along the real line. 


“Qathshala 
A reasonable choice for the prior : Normal Qk WTOSITelIT 


M 


Why is the normal distribution a good choice? 


M 


Because the normal distribution is quite flexible... 
> The prior mean can take any value along the real line 


> The prior variance can be chosen so that our prior beliefs are 
(or are not) influential determinants of the posterior mean and 
variance. 


v 


Because it may be a close representation of out our prior 
beliefs. 


> if our prior beliefs are symmetric and uni-modal about some 
point along the real line. 


v 


Because it is conjugate, so the posterior will be normal. 


Normal Prior 


> Let y ~ N(u, o2) and u ~ N(m, s?) 


Ë Jathshala (i RD 
Normal Prior QST die" 


> Let y ~ N(,02) and u ~ N(m, s?) 


pluly) ec f(yle)p(e) 


athshala 
Normal Prior eder 


> Let y ~ N(,02) and u ~ N(m, s?) 


pluly) e f(yle)p(e) 
x (2x08) lexp UU — wr} 
x (2ns?) t exp E = mi 


» Drop multiplicative constants 


p(u|y) œ% exp {3,34 - wr 


E Jathshala 
Normal Prior GGoruee 


1 2 2 1 2 2 
—-— (y2 -2 —— (p -2 
252 yn + u^) 5,2 mp t. m^) 


2 


= 
= 
= 
R 
oO 
tal 
D 
— re 


2,2 2 2 
[o 205m + cám 
23202 oH oum + a 


= sy cafe S] 


athshala 
Normal Prior C TI 


1 2 2 1 2 2 
—-— (y2 -2 —— (p -2 
252 yn + u^) 2; (u mp + m^) 


2 


2,2 2 2 
[o 205m + cám 
23202 oH oum + a 


= sy cafe S] 


1 2, 2 2 
ply) x exp | - gag ted +) 
- auri + sy) (oli? — yf] 


t. of India 


athshala 
Normal Prior GGsten 


1 2/2 2 
ply) x epf - gag (ed +) 


Ee ee E n - en» 


Third term is free from js. So it is part of normalizing 
constant. We can drop it. 


athshala 
Normal Prior GGsten 


1 2, 2 2 
ply) x epf - gag (ed +) 
- uri + sy) + (ob? — b) 


Third term is free from js. So it is part of normalizing 
constant. We can drop it. 


1 
p(u|y) «x exp { - 3,353 (v0 + 55) — 2u(o§m + e) 
0 


Normal Prior 


pluly) 


ÈR athshala RD 
O UTS QI eT tt 2» m 


1 
exp — 559 [uco + 8°) - 2u(o0m + £y) 
28°06 


Multiply with soe 


Ld. 1 m y 
Hr a] 


Multiply and divide by (s ? cog?) ! 


1/1 ES 
exp ala) (4 


ab 


8 


X 


Q 


Normal Prior 


m —v 
T — 
SSS ol amn 
6 -RS D 6 
+) a 
Lm + + + 
x din a EPS ais ER, 
=S ~ VN P a, 
+ Eo a WY 
+ A N 
=M E 
Sa n 
N ponet N 
2 Au 
——" ——^ 
NS Pm 
= T = SS 
+ + 
a lh - lh 
Nu x 
-|CY MIN 
o 4 o 
en Q 
A "X 
Oo oO 
[el [el 
E, 
=> 
ES 
[SM 


11/27 


athshala RD 
Normal Prior Gaara d» * 


ii dy(a (B+) 
(uly) c exp dC ES z) H PE 
ul (5*3 
2 2 dr) 
1 1 (2 mU 
ox exp | (a+ JI : E bak 1 
K (a+ a 


Normal Prior 


SEE 


athshala 
WJTOSITeIT 


Let " 
m | y 
1 1 
(ata 
0 
2 
TEN MECAEI 
pluly) x exp |—5( ee T (23) 
sO og 
> Thus posterior distribution 
pluly) ~ NA, ô’) 
the posterior mean 
m , y 
ND 
E(uly) = p= 4——À 
+ 
CE) 


t. of India 


. Ag athshala 
Normal Prior ) TOSITeIT 
> Thus posterior distribution 

pluly) ~ N (f, 6?) 


the posterior mean 


5 + 44) 
& s2 c2 
E(uly) = B= 

(++ x) 


the posterior variance 


AMNES E 
Var(u|y) = 9? = (3 Y =) 


14 / 27 


t. of India 


athshala 
Discussion Getestet 


> The posterior mean is the weighted average of the data and 
prior mean, i.e., 


m, y 
(ara 
E(uy)—-À = ~—> 
(++) 
= wmt+(l—w)y 


15 / 27 


t. of India 


athshala 
Discussion Getestet 


> The posterior mean is the weighted average of the data and 
prior mean, i.e., 


(a+) 
: st 
E(uy)—-À = 7—* 
(++) 
= wmt+(l—w)y 
where Í " 
dm (3) —_ 70 
1.47 ae + s? 
sg : 


15 / 27 


MHRD 


T ovt. of India 


Ë Jathshala 
A 


Normal prior "Q UTS eT 


With n observation 


If we have more than one observation of the random variable y 
from the normal distribution with unknown mean and known 
variance, it is a trivial matter to extend this result to show that: 


pluly, Yn) ~ N (fi, 6?) 


where j 
l (E E 
û = E(ulnis Yn) = 7 
(2+2) 
s2 aĝ 
and 1 
E 1 QW 
ge = Vaor als Un] = ara 
S 00 


16 / 27 


Normal prior 


» The posterior mean: 
E(u|vi; Y2; ++) Yn) =wm+ (1 = w)¥ 


(3) : 
s2 _ ý nad 


1 n 
31.2 


) "unu dnd lege gus 


Xr y 


3R athshala 


Normal prior 20 UTS MTT 


> The posterior mean: 


E(u|yi Ya; + Yn) =wm + (1—w)y 


1 
(3) o2 2 
where w = : = 3420, and 1— w — 48 
lyin ogtns cens 
31.2 


Note 


Clearly, as n — co then w — 0, therefore posterior mean tends 
towards mle, i.e., effect of prior washes away as sample size 
become larger. 


17/27 


t. of India 


Application 


Application 


What is the average blood pressure reading of a sub-population of 
20 adult men? 


> Let y; denote the blood pressure of individual 7 and assume 
that y; is normally distributed. 


18 / 27 


MHRD 


Iil Govt. of India 


Ë Jathshala 
© S WTO AT 


Application 


Application 


What is the average blood pressure reading of a sub-population of 
20 adult men? 


> Let y; denote the blood pressure of individual 7 and assume 
that y; is normally distributed. 


» Based on national studies, we "know" that the standard 
deviation of y is 13. We will use this information in our favour. 


18 / 27 


T ovt. of India 


@Qethshala 
Application eS UST 


Application 


What is the average blood pressure reading of a sub-population of 
20 adult men? 


> Let y; denote the blood pressure of individual 7 and assume 
that y; is normally distributed. 


> Based on national studies, we "know" that the standard 
deviation of y is 13. We will use this information in our favour. 


> Let u denote the average blood pressure of the group. 


18 / 27 


Ak athshala {7 MHRD 


r OTT t. of India 


Application 


Application 


What is the average blood pressure reading of a sub-population of 
20 adult men? 


> Let y; denote the blood pressure of individual 7 and assume 
that y; is normally distributed. 


» Based on national studies, we "know" that the standard 
deviation of y is 13. We will use this information in our favour. 


> Let u denote the average blood pressure of the group. 


> Thus yi ~ N(u,13) 


18 / 27 


Application 


© yi ~ N (ui, 13?) and p ~ N(m, s?) 


athshala 
Application C EI 


> yi ~ N(u, 13?) and p ~ N (m, s?) 


» Why might it be reasonable to choose a normal prior for u? 
Why might it be useful? 


19 / 27 


t. of India 


athshala 
Application eder 


> yi ~ N(p, 137) and y ~ N (m, s?) 


> Why might it be reasonable to choose a normal prior for u? 
Why might it be useful? 


» What are the problems with using a normal random variable 
to describe the blood pressure data? 


19 / 27 


t. of India 


Ë Jathshala 
Application ee UST 


> yi ~ N(p, 137) and y ~ N (m, s?) 


v 


Why might it be reasonable to choose a normal prior for u? 
Why might it be useful? 


» What are the problems with using a normal random variable 
to describe the blood pressure data? 


» What values should we choose from m and s?? 


19 / 27 


t. of India 


Ë Jathshala 
Application ee UST 


> yi ~ N(p, 137) and y ~ N (m, s?) 


> Why might it be reasonable to choose a normal prior for u? 
Why might it be useful? 


» What are the problems with using a normal random variable 
to describe the blood pressure data? 


» What values should we choose from m and s?? 


> What is the posterior distribution of u? 


19 / 27 


i . Ak athshala RD 
Application JUTOSITeIT aay 


> y; ~ N(u, 132) and u ~ N(m = 0, s? = 1000), y = 128 and 
n = 20 


» What is the posterior distribution of u? 
pluly, 23 Yn) á N(f, ô?) 


where 
0 20(128) 
. eT + 432 ) 
ji = E(ulyt, Jn) = IT ^ 126.93 
Gon + i5) 
and 


E 1 20\ 7! 
C= Var(ul|yi, Ss Ms) = 1000 + 132 = 8.38 


Ë Jathshala 
Application SHST ay” 


5 — likelihood 
N — Posterior 
e 
D 
7) 
5 o 
O T 
e 
[e] 
e 
e 


120 125 130 135 140 


mu 


Ë Jathshala 
Application SIST ay” 


5 — likelihood 
N — Posterior 
e 
D 
7) 
5 o 
O T 
e 
[e] 
e 
e 


Ë Jathshala 
Application SHST ay” 


5 — likelihood 
N — Posterior 
e 
D 
7) 
5 o 
O T 
e 
[e] 
e 
e 


120 125 130 135 140 


mu 


Ë Jathshala 
Application SHST ay” 


5 — likelihood 
N — Posterior 
e 
D 
7) 
5 o 
O T 
e 
[e] 
e 
e 


120 125 130 135 140 


mu 


Ë Jathshala 
Application SHST ay” 


5 — likelihood 
N — Posterior 
e 
D 
7) 
5 o 
O T 
e 
[e] 
e 
e 


120 125 130 135 140 


mu 


Ë Jathshala 
Application SHSAT ay” 


5 — likelihood 
N — Posterior 
e 
D 
7) 
5 o 
O T 
e 
[e] 
e 
e 


120 125 130 135 140 


mu 


Ë Jathshala 
Application SHST ay” 


5 — likelihood 
N — Posterior 
e 
D 
7) 
5 o 
O T 
e 
[e] 
e 
e 


120 125 130 135 140 


mu 


athshala 
GR |) OTST 


A Gateway to all Postgraduate Courses An MHRD project under its National Mission on Education through ICT (NME-ICT) 


Subject: Statistics 


Paper: Statistical Inference 
Module: Bayesian Analysis of Normal 
Distribution - Part 2 


MHRD 


Hi Govt. of India 


@Qethshala 
Development Team eS UST 


Principal investigator: Dr. Bhaswati Ganguli, Professor, 
Department of Statistics, University of Calcutta 

Paper co-ordinator: Dr. Dipak K Dey, Associate Dean and BOT 
Distinguished Professor, Department of Statistics, 
University of Connecticut 

Content writer: Dr. Sourish Das, Assistant Professor, Chennai 
Mathematical Institute 


Content reviewer: Department of Statistics, University of Calcutta 


CQ athsnata 
Introduction +) UTGSITeIT 


Bayesian Analysis of Mean 
Bayesian estimation of the mean when variables are normally 


distributed and variance is also unknown. 


> In this module, we will consider Non-informative Improper 
prior for unknown mean and variance 


| Govt. of India 
^ 


Ë Jathshala ` 
Normal Model with unknown Variance © €) UTOSITeIT 


> Suppose y; ~ N (yu, o?) 


Normal Model with unknown Variance 


> Suppose y; ~ N (nu, a?) 
> where c? and p are unknown random variable 


Normal Model with unknown Variance 


> Suppose y; ~ N(u, o?) 
> where c? and p are unknown random variable 
> This is the first example of multi-parameter model. 


vt. of India 


athshala 
Normal Model with unknown Variance Getestet 


> Suppose y; ~ N(u, o?) 
> where c? and p are unknown random variable 
> This is the first example of multi-parameter model. 


» The Bayesian setup still look familiar: 


p(p, ol, Yn) e p(ylu o)p(u, o) 


t. of India 


Normal Model with unknown Variance 


> Suppose y; ~ N (u, o?) 
> where c? and p are unknown random variable 
> This is the first example of multi-parameter model. 


> The Bayesian setup still look familiar: 
plu, o |ui, ++) yn) ex pulp, o )p(u, o) 
> we would like to make inferences about the marginal 


distributions p(u|y) and p(|y), where y = (yi, ..., Yn); rather 
than the joint distribution p(y, c |y). 


t. of India 


200 
Normal Model with unknown Variance e 
> Suppose y; ~ N (u, o?) 


> where c? and p are unknown random variable 
> This is the first example of multi-parameter model. 


> The Bayesian setup still look familiar: 


p(p, oh, +++ Yn) e p(ylu o)p(u, o) 


> we would like to make inferences about the marginal 
distributions p(u|y) and p(|y), where y = (yi, ..., Yn); rather 
than the joint distribution p(y, c |y). 


» Ultimately we would like to find: 


pluly) = f ros ce 


HUND ym E ———— 


Normal Model with unknown Variance 


» Note that the equation 


p(uly) = f p(y, oly)do 


cen be presented as: 


pluly) = f p(ulo,y)p(oly)do 


208 Jathshala 
Different Approach to Choose Priors v 9 UTOSITeIT 


Classical Bayesians 


The prior is a necessary evil. Choose priors that interject the least 
information possible. 


MHRD 


T ovt. of India 


Ë Jathshala 
A 


Different Approach to Choose Priors "Q UTS eT 


Classical Bayesians 


The prior is a necessary evil. Choose priors that interject the least 
information possible. 


Modern Parametric Bayesians 


The prior is a useful convenience. Choose prior distributions with 
desirable properties (e.g. conjugacy). Given a distributional choice, 
prior parameters are chosen to interject the least information 
possible. 


à Jathshala 
A 


Different Approach to Choose Priors "e UTSNM AT 


Classical Bayesians 


The prior is a necessary evil. Choose priors that interject the least 
information possible. 


Modern Parametric Bayesians 


The prior is a useful convenience. Choose prior distributions with 
desirable properties (e.g. conjugacy). Given a distributional choice, 
prior parameters are chosen to interject the least information 
possible. 


Subjective Bayesians 


The prior is a summary of old beliefs Choose prior distributions 
based on previous knowledge - either the results of earlier studies 
or non-scientific opinion. 


'"MHRD 


ovt. of India 


The Classical Bayesian: normal model (AQathshata 
with unknown mean and variance 3) 


> The y ~ N(u,07) where u and ø are both unknown and 
random variables. 


athshala 
WJTOSITeIT 


MHRD 


ovt. of India 


The Classical Bayesian: normal model a 
with unknown mean and variance german 


> The y ~ N(u,07) where u and ø are both unknown and 
random variables. 


» What prior distribution would you choose to represent the 
absence of any knowledge in this instance? 


athshala 
WJTOSITeIT 


MHRD 


ovt. of India 


The Classical Bayesian: normal model a 
with unknown mean and variance DE 


> The y  N(u, a?) where u and ø are both unknown and 
random variables. 


» What prior distribution would you choose to represent the 
absence of any knowledge in this instance? 


» What if we assumed that the two parameters were 
independent, so p(u, c?) = p(u)p(a?)? 


The Classical Bayesian: normal model AQathshata 
with unknown mean and variance ETT 


> The y ~ N(u, 0?) where u and o are both unknown and 
random variables. What prior should we choose? 


MHRD 


Govt. of India 


The Classical Bayesian: normal model Ë athshala 
with unknown mean and variance al 


> The y ~ N(u, 0?) where u and o are both unknown and 
random variables. What prior should we choose? 


> if p(u, o?) = p(u)p(c?) - one option would be to assume 
uniform prior distributions for both parameters. Thus 
plu) x c for —oo«yuc«oo 


1 
p(c?) x -z for 0 « o? < oo 


MHRD 


cout. of India 


The Classical Bayesian: normal model Ë athshala 
with unknown mean and variance al 


> The y ~ N(u, 0?) where u and o are both unknown and 
random variables. What prior should we choose? 


> if p(u, o?) = p(u)p(c?) - one option would be to assume 
uniform prior distributions for both parameters. Thus 
plu) x c for —oo«yuc«oo 


1 
p(c?) x -z for 0 « o? < oo 


> And the joint density would be: p(u, o?) x 4? 


MHRD 


cout. of India 


The Classical Bayesian: normal model el’ athshala 
with unknown mean and variance B ATS SITelT 


> The y ~ N(u, 0?) where u and o are both unknown and 
random variables. What prior should we choose? 


> if plu, o?) = p(u)p(o?) - one option would be to assume 
uniform prior distributions for both parameters. Thus 
plu) x c for —oo«yuc«oo 


1 
p(c?) x -z for 0 « o? < oo 


> And the joint density would be: p(u, a?) x 35? 


> Are these distributions proper? 


The Classical Bayesian: normal model (AQathshata 
with unknown mean and variance ni 


> Let y;  N(u,o0?) where i = 1,2, ...,n and p(u, o?) x 4 


MHRD 


Govt. of India 


The Classical Bayesian: normal model Ë a ` 
with unknown mean and variance cea 


> Let y; ~ N(u,o0?) where i = 1,2,...,n and p(u, o?) x ES 


> The posterior distribution is: 


p(u,c?|y) oc p(ylu.o?) x plu, c?) 


where 5? = 57? (yi — y)? and y = (yi, s) 


MHRD 


Govt. of India 


The Classical Bayesian: normal model Ë athshala 
with unknown mean and variance IT 


> Let y; ~ N(u,o0?) where i = 1,2, ...,n and p(u, o?) x ES 
> The posterior distribution is: 


p(u,c?|y) oc p(ylu.o?) x plu, c?) 
1 n/2 1 " | ; 
Ino? exp -a - 1)s" +n — p) 


where 5? = 57? (yi — y)? and y = (yi, Yn) 


» |t can be shown that the conditional posterior distribution 


plula’, y)  N(g, o? /n) 


MHRD 


| Govt. of India 


The Classical Bayesian: normal model AQathshata i 
with unknown mean and variance 3) 
> The marginal posterior distribution of u is 


p(u|y) = s plu, o°ly)do? 


MHRD 


[ Govt. of India 


The Classical Bayesian: normal model AQathshata i 
with unknown mean and variance i 
> The marginal posterior distribution of u is 


p(u|y) = s plu, o? |y)de? 


» |t can be shown that 


pluly) ~ tr-1(9, s/n) 


MHRD 


[ Govt. of India 


The Classical Bayesian: normal model ip atnshata 
with unknown mean and variance ei 
> The marginal posterior distribution of u is 


p(u|y) = s plu, o? |y)de? 


» |t can be shown that 


pluly) ~ tr-1(9, s/n) 


> Or more conviniently, 


MHRD 


| Govt. of India 


The Classical Bayesian: normal model AQathshata i 
with unknown mean and variance i 
> The marginal posterior distribution of c? is 


p(o*|y) = 1 em exp E -1)s +n- uw dp 


—oo 


MHRD 


Govt. of India 


The Classical Bayesian: normal model Ë nthehain ` 
with unknown mean and variance geet 


> The marginal posterior distribution of c? is 


p(a?|y) = 1 em exp E -1)s +n- uw dp 


—oo 


> It can be shown that it follows scaled-inverse x? distribution 


p(c?|y) ~ Inv-x?(n — 1, s?) 


11/23 


athshala 
WJTOSITeIT 


MHRD 


Govt. of India 


The Classical Bayesian: normal model a 
with unknown mean and variance DE 


> The marginal posterior distribution of c? is 


p(a?|y) = 1 em exp E -1)s +n- uw dp 


—oo 
> |t can be shown that it follows scaled-inverse x? distribution 


p(c?|y) ~ Inv-x?(n — 1, s?) 


> Or more conviniently, 


p(c?|y) ~ Inv-Gamma((n — 1)/2, (n — 1)s?/2) 


11/23 


athshala 


WTOSITeIT 


The Classical Bayesian: normal model a 
with unknown mean and variance german 


> Big point to note here: 


> Though prior is flat but improper - posterior is a proper 
probability distribution. 


» Hence proper statistical inference can be drawn from it. 


Two Different methods to sample 
from the posterior in this case 


Method 1 


Sample directly from each of the two marginal distributions. 


13 / 23 


MHRD 


cout. of India 


Two Different methods to sample a athshala 
from the posterior in this case SEIT 


Method 1 


Sample directly from each of the two marginal distributions. 


Method 2 
Two stages: 


1. Sample a value from the marginal distribution of c?|y 


2. Sample a value from the marginal distribution of u|o? 


13 / 23 


t. of India 


athshala 
Application eter 


Application 


What is the average blood pressure reading of a sub-population of 
20 adult men? 


> Let y; denote the blood pressure of individual ¿ and assume 
that y; is normally distributed. 


14 / 23 


t. of India 


Hin 
Application eg 


Application 


What is the average blood pressure reading of a sub-population of 
20 adult men? 


> Let y; denote the blood pressure of individual 7 and assume 
that y; is normally distributed. 


> Let u denote the average blood pressure of the group. 


14 / 23 


t. of India 


Ë Jathshala 
Application ee UTS TT 


Application 


What is the average blood pressure reading of a sub-population of 
20 adult men? 


> Let y; denote the blood pressure of individual 7 and assume 
that y; is normally distributed. 


> Let u denote the average blood pressure of the group. 


> Thus y; ~ N (u, o?) 


14 / 23 


Application 


> The sample average is y = 128 and sample sd is s = 7.67 


15/23 


a Jathshala ' 
Application eG ITOSITeIT 


> The sample average is 7 = 128 and sample sd is s = 7.67 


> The marginal distribution is p(u|y) ~ tn—1(y, s?/m) 


15 / 23 


athshala 
Application ette 


> The sample average is 7 = 128 and sample sd is s = 7.67 
> The marginal distribution is p(u|y) ~ tn—1(y, s?/n) 


> By the properties of t-distribution which we can find as 


E(u|y) = y = 128 and Var(u|y) = (= 5) = cm 


15/23 


Application 


» Analytical Results: 


E(u|y) = 7 = 128 and Var(uly) = (5) — 3.27 


16 / 23 


Application 


» Analytical Results: 


E(u|y) = y = 128 and Var(u|y) = e z) eda 


» Numerical Results: 
Summary of mu : 


Min. 1st Qu. 
121.3 126.8 


Variance of mu : 
[1] 3.285445 

Sd of mu : 

[1] 1.81258 

95% CI of mu : 


2.5% 97.5% 
124.3952 131.5632 


Median 
128.0 


Mean 3rd Qu. 
128.0 129.2 


vt. of India 


Max. 
135.9 


Application 


> The sample average is y = 128 and sample sd is s = 7.67 


17 / 23 


athshala 
Application LLLI 


> The sample average is y = 128 and sample sd is s = 7.67 


> The marginal distribution of c? is p(c?|y) ~ Inv-x?(n — 1, s?) 


17/23 


athshala 
Application eder 


> The sample average is y = 128 and sample sd is s = 7.67 
> The marginal distribution of c? is p(c?|y) ~ Inv-x?(n — 1, s?) 


> By the properties of scales inv y?-distribution we can find as 


EN ef aa \ 
E(o?|y) ^ s (=) = 64.9 


17 / 23 


Application 


» Analytical Result: 


n 


n—l| _ 
E(o?|y) = s? (=) = 64.9 


18 / 23 


vt. of India 


athshala 
Application eder 


» Analytical Result: 
B(o?|y) = s? (223) = 64.9 


> Numerical Result: 
Min. ist Qu. Median Mean 3rd Qu. Max. 
18.09 49.04 60.49 65.36 76.35 362.80 


18 / 23 


a Jathshala 
Application SHSAT ay 


plot of posterior densities from method 1 


marginal distribution of mean 


o 
N 
o 
> 
= 
2 oS 
o T 
o o 
o 
e 
eo 


120 125 130 135 


19 / 23 


Ë Jathshala 
© S WOM AT 


Application 


plot of posterior densities from method 1 


marginal distribution of variance 


[e] 
N 
e 
e 
>) 
E 
2 o 
o 
5 o 
CoD oO 
e 
oO 
e 
eo 


Comparing Analytical Results Ë athshala (É 
with Numerical Method 2 cea 


Numerical Method 2 
Two stages: 


1. Sample a value from the marginal distribution of c?|y 


MHRD 


Yeout. of India 


Comparing Analytical Results Ë athshala 
with Numerical Method 2 al 


Numerical Method 2 
Two stages: 


1. Sample a value from the marginal distribution of c?|y 


2. Sample a value from the marginal distribution of jo? 


Comparing Analytical Results Ë athshala 
with Numerical Method 2 Getaran 


Summary of mu : 


Min. ist Qu. Median Mean 3rd Qu. Max. 
118.7 126.8 128.0 128.0 129.2 136.4 


Variance of mu : 
[1] 3.147524 

Sd of mu : 

[1] 1.774126 
95% of mu : 


2.5% 97.5% 
124.5105 131.5384 


Comparision between Analytical a athshala 
and Numerical Results S ELI d 


Analytical Numerical Method | Numerical Method ll 
E(uly) 128 128.0 128.0 
V(u|y) 3.27 3.29 3.15 
E(o?|y) 64.9 65.4 65.4 


Bayesian Conjugate Priors for the Normal 
Distribution 


1 Conjugate Prior 


Definition: 


Let X be a random variable having density f(2|@) indexed by 0. 
Let F—(f(x|0) ; x € X, 0 € O} bea class of density functions. 
A class P—(n(0); 0 € ©} of prior distributions is said to be a conjugate family 
for F if the posterior distribution 7(6|z) is in the class P for all x € ¥ and for 


all r EP 


Example: 


Note :- 


Suppose X ~ N(u,c?), une OS & 


1 
f (alu) = eae veg 


Ov 2T 


Suppose P={ N(o, 87), aoc; B>O}, ie 


1 = y)? 
n(n) = ——e 30-9. WER 


Bv2n 


B? 2 1 
Then z(u|z) ~ N( axe + Bip oF? ans 


a) 
B2 
Le r(u|v) € P 

Hence P is conjugate for F, 


where F—(N (u,0?), p € R, o? > 0 is known} 


E Xi, X2, -Xn UM ,o7), wER and o? is known 
p H 


Then F={f (xu) = Ae 2? 2007", RY 
o" (vV 2r)” 


If P={ N(a, 8), aeR; 820), 


Then z(u|x) ~ N(—7——, ae 
SEP. c2 B2 


Let Xi, Xo Xn 8 N(u,c?), both u and o? are unknown 


Let us consider the following prior distributions 


-zz (#8)? 


v(u|o?) x (o2)73 e °S —> Normal Prior 


à 
n (o?) x (a?) e727 —> Inverse Gamma Prior 


To find the Posterior Distribution 


Posterior distribution of (u, o?) 


n (n, a?|x) x L(u,c?|x) n(u|lo?)n(o?) 


MEE D ANC 1 yer es" o. -4 
apg Co ge gré 
To find the Posterior Distribution of c? 
2 n So. __ n+so 
Coeff of u ^ 72e2 2092 ~~ ( 202 ) 
Coeff of p = 23 + 30) = (27Es08) 
Now, 
Ld RC gees ic. pl d E Pag 
n 1 2c Pme c2 c naso 2 nis 
"(u,c?|x) = const (3,)2***1*3e MET go ber) bm 
As 
n-cso[,,2 nZz-soó 
“202 [u 2 n+so ] 
= -25o ( ni+sod y di (nz--s99)? 
pe 202 n+so 202 (n4-s9) 
We have i 
Set " 2 1_52_ £L 4 rets09) 1 ( net 
n 1 202 2,1214 ~ Fok c2 | 262 (ns) 2.02. \H~ "ntso 
n (p, a?|x) = const (Serer ius e 50 nid 


x(e?lx) = f^. «Qu. e?|x) dn 


1 5 2 1_52_ 8 | (nē+s08)? 66 1 


a 1 202 Liat 222 o2 ' 202 (n sg) ^ ac22 Fso 
= const (1,)3 o 1*3 e 30 e ^ntso du 
— oo 


B E (nz-4-s96)? ) 


n 1 n 2 1 8? 
—3 Inverse Gamma (5 +4,3 $4 vi 2 
Posterior Distribution of u given o? 
2 
2 qo rss? X) 
n(u|o?,x) = SEIX 
Thus we have 
( nB+s96 12 
2 1 acere ) TARO 
n(u|c?, x) = e “Antso 
V2 = 
/n+50 


i.e 


2 


n(u|o?, x) "S IN ( 2:509 c ) 


n+so ? n4-so 


—> Normal distribution 


2(n+s0) 


Hence the class of prior distributions (z(u|o?), u € R} and (1(o?), o? > O} as 


considered are conjugate. 


Also, 
z(ulx) = ff n(u,o?|x) do? 


=f (lo?) m(o?|x) do? 
eY~_ 


Normal Inverse Gamma 


— leads to a t distribution with df=(2a + 1+ n) multiplied by 


some scale parameter 


2 Posterior Predictive Distribution 


Tf My aye, t fü) 
(0) — prior distribution of 0 
m(0|x) —»posterior distribution of 0 given the data x 


The posterior predictive distribution of a new observation x* is given by 


(z*|x) = Ts f (x*|0) (|x) dé 


Example: 


Suppose X1, X5, ...., Xn 77 N(n,o 2) 


n(ulo?) = N(8, 2) 


m(a*) = IG(a, B) 


both u and c? are unknown 


Then, m(u,c?|x)— n(ulo?, x) n(c?|x) 


= N(22+808 2) IG (œ+ %, B*) , where 


n+so ? n4-so 
*— l1 yon 2. 16$ qí0 (ng so)? 
B = Xi Ti 2 so B 2(n4-so) 


E 3:- g 
aso) and b TAF 


Let y = 


The posterior predictive distribution of a new observation x* is 


= | J iem fluo?) «(ulo?,x) ^ a(s?|x) ^ dude? 


N(u,o d N(y,b?) Inverse Gamma 


— This leads to a t-distribution with df= n + 2(a + 1) multiplied 
by some scale parameter 


2.1 Drawing samples from the Posterior Predictive Distribution 


Method 1 : - 


We draw samples from a t-distribution with df= n+ 2(o +1) multiplied by some 
scale parameter (which is a function of the hyper parameters), by evaluating the 
exact analytic form of the distribution 


Alternative Method : - 


Now, p(z*|x) = f, Jsa N(u, o?) N(7,b*) IG (a + §, 8*) de? du 


where y,b?, 8* are as before 


Method 


1) Draw c;? from z(c?|x) = IG (a + 5, 8*) 
2) Draw u; from v(u|oj? , x) = N (7, A 
3) Draw z;* from N(pj, oj”) 


— Repeat this to get desired number of samples from the required 
t-distribution 


2.2 Highest Posterior Density Credible Set 


X is a random variable with density f(x|0) indexed by 0 

7(0) — prior distribution of 0 

7(0|x) — posterior distribution of 0 given the data x 

The 100(1-a)% HPD credible set for 0 is the subset C of ©, where 


C-(0€0:m(0x)2 k(a)}, where, k(o) is such that 


Jia) "(0x)d021—a 


Example: 
Suppose X1, X», ...., Xn m N(u,c?) both u and c? are unknown 
We want to find the 100 (1-a)% HPD credible set for u 


As usual we have, 


z(u, 02|x)= n(ulo?, x) n(s?lx) 


= N(y,t?) IG (a + 5, 8*) , where 


* 1 n 2 15 (n4-s90)? 
B= g $a Ti 235 PF 2(n+s0) 


z 2 2 
= (nEts and b? = Z 


T3 2(n+so) n+so 


m(yilx) = fg mu, o?|x) do? 


—> t-distribution with df=2a+n+1 


Suppose we have the data: 
x <- c(1.64, 1.70, 1.72, 1.74, 1.82, 1.82, 1.90, 2.08) 


Shapiro.test(x) Testing for normality of the data 


TEHETHEHIEIEHEIHEIETHHEIBETHHEIBIEHBHBIBEHBHBIHHHBEHHBHHHHBHHHHHBHE 
Shapiro-Wilk normality test 


data: x 
W = 0.91699, p-value = 0.4059 
HHHHHHHAREHHHHHHHHHHAAAHAA RARER HHER RRR R AERA 


x. bar=mean (x) 
n=length (x) 


##Hyper-parameters 
delta=0.14; s.0=1.25; alpha=2.3; beta=1.78 


##Parameters of the posterior distributions 


gamma=((n*x.bar)+(s.0*delta))/(nt+s.0); b.2-var(x)/(n*s.0) 
beta. star=0.5* (sum(x*2))-0.5* (delta*2/s.0)-betat+(((n*x.bar)+(s.0*delta) )*2)/(2* (n+s.0)) 


M=10000 
install.package("invgamma") ##For drawing samples from inverse gamma distribution 
library (invgamma) 


draw=0 


for(i in 1:M) 

{ 
sigma. j=rinvgamma(n=1,shape=alphat(n/2) ,shape=beta.star) 
mu. j=rnorm(n=1,mean=gamma, sd=sqrt ((sigma.j)/(nts.0))) 
draw [i]=rnorm(n=1,mean=mu. j ,sd=sqrt (sigma. j)) 


k.alpha-quantile(draw,0.05) 
k.alpha 


5h 
1.421776 


So we have obtained the k(a) value for the HPD credible set and hence the 
HPD credible set for the problem is 


C={u E R: n(u|x) > 1.421776 } 


athshala 
GR |) OTST 


A Gateway to all Postgraduate Courses An MHRD project under its National Mission on Education through ICT (NME-ICT) 


Subject: Statistics 


Paper: Statistical Inference 
Module: Bayesian Analysis of Multiparamter 
Models 


MHRD 


Iil Govt. of India 


@Qethshala 
Development Team eS UST 


Principal investigator: Dr. Bhaswati Ganguli, Professor, 
Department of Statistics, University of Calcutta 

Paper co-ordinator: Dr. Dipak K Dey, Associate Dean and BOT 
Distinguished Professor, Department of Statistics, 
University of Connecticut 

Content writer: Dr. Sourish Das, Assistant Professor, Chennai 
Mathematical Institute 


Content reviewer: Department of Statistics, University of Calcutta 


vt. of India 


Ë Jathshala 
Introduction € S UTOSITeIT 


Multinomial Model for Categorical Data 


The multinomial sampling distribution is used to describe data for 
which each observation is one of k possible outcomes. 


> If y is the vector of counts of number of observations of each 
outcome, then 


k 
p(y9) x | [ €? 
j=l 


where X 0; — 1 and x yj—n 


Conjugate Prior for Multinomial Distributio@e seater 


athshala 


vt. of India 


» The conjugate prior distribution is a multivariate 


generalization of the beta distribution known as the Dirichlet 
distribution. 


where 0 «0; < 1 Vj with 355. (oe = ls 


Ak athshala 
Conjugate Prior for Multinomial Distributio*e S Woeairet 


vt. of India 


» The conjugate prior distribution is a multivariate 
generalization of the beta distribution known as the Dirichlet 
distribution. 


where 0 «0; < 1 Vj with 355. (oe = ls 


> The resulting posterior distribution for the Os is a Dirichlet 
with parameter aj + yj 


fe} Jathshala 
Conjugate prior for Multinomial Distribution? S Ware 


» The prior distribution is mathematically equivalent to a 
likelihood resulting from ys aj observations of j^" outcome 
category. 


athshala 
Conjugate prior for Multinomial Distribution ) STS RIT AT 


> The prior distribution is mathematically equivalent to a 


likelihood resulting from ys aj observations of j^" outcome 
category. 


» Asin the binomial there are several plausible noninformative 
Dirichlet prior distributions. 


di Jathshala 
Conjugate prior for Multinomial Distributio e$ WTOSITeIT 


MHRD 
Hil Govt. of India 
RO, 


> The prior distribution is mathematically equivalent to a 
likelihood resulting from Ya aj observations of j*” outcome 
category. 


» Asin the binomial there are several plausible noninformative 
Dirichlet prior distributions. 


> A uniform density is obtained by setting aj; = 1 for all j; this 
distribution assigns equal density to any vector 0 satisfying 


DM 0j — 1. 


t. of India 


fe} Jathshala ' 
Conjugate prior for Multinomial Distributions? S qTOSITelT 


> Setting a; = 0 for all j results an improper prior distribution 
that is uniform in the log(0;)'s. 


athshala 
Conjugate prior for Multinomial Distribution ) ITS kIT AT 


> Setting a; = 0 for all j results an improper prior distribution 
that is uniform in the log(0;)'s 


> The resulting posterior distribution is proper if there is at least 
one observation in each of the & categories, so that each 
component of y is positive. 


Ë Jathshala 
Q 


Application Multinomial Distribution "$ YTOSITeIT 


Pre-election Polling 


For a simple example of a multinomial model, we consider a sample 
survey question with four possible responses. In late January 2014, 
a survey was conducted by CSDA of 18591 adults to find out their 
preferances in upcoming general election of India. Out of 18591 
persons , yj = 5205 supported UPA, y» = 6507 supported NDA, 
y3 = 929 supported left and y4 = 5950 supported others 


Application Multinomial Distribution 


Data: 
Support 
UPA | y; = 5205 
NDA | yo — 6507 
Left y3 = 929 
Others | y4 = 5950 
Total | n = 18591 


assure 


athshala 


RD 


Application Multinomial Distribution 


» With a noninformative uniform prior distribution on 
0,01—02— Q3 = Q4 = 1 


. . 2. . Ag athshala & RD 
Application Multinomial Distribution JUTOSITeT ay” 


> With a noninformative uniform prior distribution on 
0,01—02— &3 = Q4 = 1 


> The posterior distribution for (01,05, 05, 04) is 
Dirichlet(5206,6508,930,5951). 


@Qethshala 
Application Multinomial Distribution SHST ay” 


> With a noninformative uniform prior distribution on 
0,01 ag Q3 QA 1 


> The posterior distribution for (01,05, 03, 04) is 
Dirichlet(5206,6508,930,5951). 


» We could compute the posterior distribution of 01 — 05 by 
integration, but it is simpler just to draw 1000 points 
(01, 05,03, 04) from posterior Dirichlet distribution and then 
compute 01 — 05 for each. 


à Jathshala 
HUST gy’ 


Application Multinomial Distribution 


[e 
LO 
2 
T mi 
e 
[«») 
o co 
(m) 
oO 
T 


o 
[T T T | 
-0.09 -0.08 -0.07 -0.06 -0.05 


theta1-theta2 


10 / 24 


t. of India 


i . . . o . Ak athshala 
Application Multinomial Distribution ) TO SITeIT 


UPA NDA Left Others 
0.2800 0.3500 0.0501 0.3200 


UPA NDA Left Others 
2.5% 0.2736 0.3431 0.0469 0.3135 
97.5% 0.2865 0.3566 0.0533 0.3266 


theta 1 - theta 2: 
[1] -0.07 


2.5% 97.5% 
-0.0811 -0.0582 


11/24 


t. of India 


Application Multinomial Distribution 


» The estimated posterior probability NDA had more support 
than UPA in the survey population is 99%. 


MHRD 


Hi Govt. of India 


@Qathshala 
Application Multinomial Distribution eS UST 


> The estimated posterior probability NDA had more support 
than UPA in the survey population is 99%. 


> In complicated problems-for example, analyzing the results of 
many survey questions simultaneously-the number of 
multinomial categories, and thus parameters, becomes so large 
that it is hard to usefully analyze a dataset of moderate size 
without additional structure in the model. 


MHRD 


T ovt. of India 


@Qethshala 
Application Multinomial Distribution eS UST 


> The estimated posterior probability NDA had more support 
than UPA in the survey population is 99%. 


> In complicated problems-for example, analyzing the results of 
many survey questions simultaneously-the number of 
multinomial categories, and thus parameters, becomes so large 
that it is hard to usefully analyze a dataset of moderate size 
without additional structure in the model. 


> Formally, additional information can enter the analysis 
through the prior distribution or the sampling model. 


MHRD 


ovt. of India 


Multivariate Normal Ë athshala 
with unknown Mean and Covariance TIE 


Conjugate inverse-Wishart family of prior distributions 


» Multivariate normal likelihood 
yis Me N(u, M) 


where u is a d-dimensional column vector and X is d x d 
symmetric positive-definite covariance matrix 


13/24 


MHRD 


cout. of India 


Multivariate Normal ele athshala 
with unknown Mean and Covariance YTS AT 


Conjugate inverse-Wishart family of prior distributions 


> Multivariate normal likelihood 


y|u, Me N(u, M) 


where u is a d-dimensional column vector and © is d x d 
symmetric positive-definite covariance matrix 


> The likelihood function for a single observation is 


plyļu, X) x || 1? exp Gc - p) XMy- 7) 


13/24 


Multivariate Normal &E athahala 
with unknown Mean and Covariance AUTSA dE 


Conjugate inverse-Wishart family of prior distributions 


» The likelihood function for a n independent and identically 
distributed observation is 


n 


1 
(yi sul D) x Eep (5u rst] 


i=1 
1 
= ||"? exp (Sire) 


where So is the matrix of ‘sum of squares’ relative to u 


14 / 24 


Multivariate Normal 
with unknown Mean and Covariance 


Conjugate inverse-Wishart family of prior distributions 
> X Inv-Wishart,, (A5 !) 


15/24 


Multivariate Normal 
with unknown Mean and Covariance 


Conjugate inverse-Wishart family of prior distributions 
> X Inv-Wishart,, (A5 !) 


> |X ~ N(uo, X/&o) 


15/2 


Multivariate Normal &E athshala & 
with unknown Mean and Covariance JuTOSITeIT dS" 


Conjugate inverse-Wishart family of prior distributions 
> X Inv-Wishart,, (A5 !) 


> p|X ~ N(uo,3/ ko) 


> The joint prior density 
poyi) o [xp Merwe 


exp { —Jtr( AoE!) — Su io) 27 — p0) 


15 / 24 


MHRD 


ovt. of India 


Multivariate Normal Ë athshala 
with unknown Mean and Covariance eT 


Conjugate inverse-Wishart family of prior distributions 


> The parameter vo and Ao describe the degrees of freedom and 
the scale matrix for the inverse-Wishart distribution on X. 


16 / 24 


Multivariate Normal o athshala 
with unknown Mean and Covariance YTS AT 


MHRD 


ovt. of India 


Conjugate inverse-Wishart family of prior distributions 


> The parameter vo and Ao describe the degrees of freedom and 
the scale matrix for the inverse-Wishart distribution on X. 


> The remaining parameters are the prior mean, jig, and the 
number of prior measurements Kg on the © scale. 


16 / 24 


Multivariate Normal a athshala 
with unknown Mean and Covariance P ETUET 


Conjugate inverse-Wishart family of posterior distributions 
> X|yi, -Yn ~ Inv-Wishart,, (A31) 


17/24 


Multivariate Normal 
with unknown Mean and Covariance 


Conjugate inverse-Wishart family of posterior distributions 
> Ny. gs ~ Inv-Wishart,, (A7) 


* Un| È, Y1, es Yn ^v N (Hn, M Rn) 


17/2 


Multivariate Normal Ë athshala 
with unknown Mean and Covariance P ETUET 


Conjugate inverse-Wishart family of posterior distributions 
> Ny, gs ~ Inv-Wishart,, (A7) 


» Mn|%, Vi; sy Un N N (un, M Rn) 


i ” m xot HO T mane 


17/24 


Multivariate Normal Ë athshala 
with unknown Mean and Covariance P ETUET 


Conjugate inverse-Wishart family of posterior distributions 
> Ny, gs ~ Inv-Wishart,, (A7) 


» Mn|%, Vi; sy Un N N (un, M Rn) 


i mn morn HO T mane 


> Kn =Ko +n 


17/24 


Multivariate Normal Ë athshala 
with unknown Mean and Covariance P ETUET 


Conjugate inverse-Wishart family of posterior distributions 
> Ny, gs ~ Inv-Wishart,, (A7) 


» Mn|%, Vi; sy Un N N (un, M Rn) 


P d m Zorn Ho + morn? 
> Kn =Ko +n 


> m =VYtn 


17/24 


Multivariate Normal &R athshala ' 
with unknown Mean and Covariance OTST 


Conjugate inverse-Wishart family of posterior distributions 
> Ny, gs ~ Inv-Wishart,, (A; +) 


* Un| È, Y1, sy Yn N N (Hn, M Rn) 


— K TL i 
dud E RE) 
> Kn = Ko +n 


> Mm =VYotn 


> An = Ao +S + ZR (9 — uo)(U — Ho)” 
where S = $77 (yi — U)(yi — 9)" 


17/24 


t. of India 


. . Ë Jathshala 
Analysis of a Bioassay Experiment © S) ITOSITeIT 


> In the development of drugs and other chemical compounds, 
acute toxicity tests or bioassay experiments are commonly 
performed on animals. 


» Such experiments proceed by administering various dose levels 
of the compound to batches of animals. 


Dose x; Number of Number of 
(log g/ml) animals, n; deaths, y; 


-0.86 5 0 
-0.30 5 1 
-0.05 5 3 
0.73 5 5 


18/24 


vt. of India 


Ë Jathshala 
Analysis of a Bioassay Experiment QUST 


Modeling the dose-response relation 
> 


Ui ^v Bin(n;, 0i) 


where 0; is the probability of death for animals given dose z;. 


19 / 24 


t. of India 


. . l ÈR athshala 
Analysis of a Bioassay Experiment EICEIGII 


Modeling the dose-response relation 
> 
Ui ^v Bin(n;, 0;) 


where 0; is the probability of death for animals given dose z;. 


> The relation ship can be modeled as 


0 
l = i 
(i) a+ Bx 


this is called logistic regression model. 


19/24 


. l ÈR athshala 
Analysis of a Bioassay Experiment EICEIGII 


Modeling the dose-response relation 
> 
Ui ^v Bin(n;, 0;) 


where 0; is the probability of death for animals given dose z;. 


> The relation ship can be modeled as 


0 
l = i 
(i) a+ fx 


this is called logistic regression model. 


> |t can also be represented as 


explo + Bari} 
1+ exp{a + Bai} 


19 / 24 


= logit! (a + Bx) 


i = 


Analysis of a Bioassay Experiment 


Modeling the dose-response relation 
> The likelihood function is 


k 


pla, Bly, n, x) x ] [legit (o 62;)]* [1—logit  (o--Bx;)|*- V: 
i=1 


Analysis of a Bioassay Experiment 


Modeling the dose-response relation 
> The likelihood function is 


k 


pla, Bly, n, x) ox ] [legit (o 62;)]* [1—-logit  (o--Bx;)|*- V: 
i=1 


> The prior is p(a, B) ~ N (0, l2) 


MHRD 


(Govt. of India 


à Jathshala 
Analysis of a Bioassay Experiment e HY QTOSITeIT 


Simulated samples from posterior distribution 


beta 


21 / 24 


à Jathshala 
A 


Analysis of a Bioassay Experiment “HUTS AT 


Trace of alpha Density of alpha 


I 
F 


0 6000 -8 -4 0 
Iterations N212500 Bandwidth = 0.12 
Trace of beta Density of beta 
RE 
© eo J 
a m— E i A 
e 


0 6000 2 6 10 


Iterations N=12500 Bandwidth = 0.14 


MHRD 


Tn ovt. of India 


Ë Jathshala 
Analysis of a Bioassay Experiment +9 ITOSITeIT 


[e] 
[e] 
e 
5 
> 9 
9 o6 
$5 a 
2 
D 
e 
i-— 
[e] 
u S 
e 
e 


23 / 24 


à Jathshala 
Analysis of a Bioassay Experiment ‘Queer dá" 


© 
e 
ike) 
N 
Pel 
o 
c [e] 
(eb) e 
2 LO 
[oH T 
[0] 
Pus 
LL 
[e] 
e 
LO 
© 


2 4 6 8 10 


GE oa NW iGevt. of India 
ICICI 


A Gateway to all Postgraduate Courses An MHRD project under its National Mission on Education through ICT (NME-ICT) 


Subject: Statistics 


Paper: Statistical Inference 


Module: Introduction to Hierarchical Models 


MHRD 


Iil Govt. of India 


@Qethshala 
Development Team eS UST 


Principal investigator: Dr. Bhaswati Ganguli, Professor, 
Department of Statistics, University of Calcutta 

Paper co-ordinator: Dr. Dipak K Dey, Associate Dean and BOT 
Distinguished Professor, Department of Statistics, 
University of Connecticut 

Content writer: Dr. Sourish Das, Assistant Professor, Chennai 
Mathematical Institute 


Content reviewer: Department of Statistics, University of Calcutta 


Bethea! athshala '"MHRD 


Introduction €' S OTST AT 


» When there are few parameters, posterior inference in 
nonconjugate multiparameter models can be obtained by 
simulation methods. 


MHRD 
J]!N Govt. of India 
E0, 


é Jathshala 
Introduction © S UTOSITelIT 


» When there are few parameters, posterior inference in 
nonconjugate multiparameter models can be obtained by 
simulation methods. 


> sophisticated models can often be represented in a hierarchical 
for which effective computational strategies are available 


é Jathshala ( RD 
General Approach to Bayesian Modeling SHSAT ay” 


1 Write the likelihood part of the model, p(y|0), ignoring any 
factors that are free of 0. 


Ak athshala RD 
General Approach to Bayesian Modeling JUTOSITeIT aay" 


1 Write the likelihood part of the model, p(y|0), ignoring any 
factors that are free of 0. 


2 Write the posterior density, p(0|y) e p(0)p(y|0). If prior 
information is well-formulated, include it in p(0). Otherwise 
use non-informative prior 


@Qethshala 
General Approach to Bayesian Modeling © e ITOSITeIT 


t. of India 


1 Write the likelihood part of the model, p(y|0), ignoring any 
factors that are free of 0. 


2 Write the posterior density, p(0|y) e p(0)p(y|0). If prior 
information is well-formulated, include it in p(0). Otherwise 
use non-informative prior 


3 Create a crude estimate of the parameters, 0, for use as a 
starting point and a comparison to the computation in the 
next step. 


vt. of India 


e Jathshala ' : 

General Approach to Bayesian Modeling e S UTOSITelT 

1 Write the likelihood part of the model, p(y|0), ignoring any 
factors that are free of 0. 


2 Write the posterior density, p(0|y) e p(0)p(y|0). If prior 
information is well-formulated, include it in p(0). Otherwise 
use non-informative prior 


3 Create a crude estimate of the parameters, 0, for use as a 
starting point and a comparison to the computation in the 
next step. 


4 Draw simulations from 6!, ..., 0^, from the posterior 
distribution. Use the sample draws to compute the posterior 
density of any functions of 0 that may be of interest. For 
non-conjugate models this step could be difficult. 


Ak athshala 
General Approach to Bayesian Modeling ) STS RIT AT 


vt. of India 


5 If any predictive quantities, y, are of interest simulate 
Jt, ..., JS by drawing each jj? from the sampling distribution 
conditional on the drawn value 6°, p(y|6°). 


Ë Jathshala 
General Approach to Bayesian Modeling QUITS eT 


vt. of India 


5 If any predictive quantities, y, are of interest simulate 
Jt, ..., JS by drawing each jj? from the sampling distribution 
conditional on the drawn value 6°, p(y|6°). 


» Various methods (such as Markov Chain Monte Carlo) have 
been developed to draw posterior simulations in complicated 
models. 


athshala 


dd 0 
General Approach to Bayesian Modeling ETE 


vt. of India 


5 If any predictive quantities, y, are of interest simulate 
Jt, ..., JS by drawing each jj? from the sampling distribution 
conditional on the drawn value 6°, p(y|6°). 


» Various methods (such as Markov Chain Monte Carlo) have 
been developed to draw posterior simulations in complicated 
models. 


» |f 0 has only one or two components, it is possible to draw 
simulations by computing on a grid. 


08 Jathshala 
Hierarchical Models T O UTOSITeIT 


» Hierarchical data is ubiquitous in the social sciences where 
measurement occurs at different levels of aggregation. 


athshala 
Hierarchical Models Getestet 


» Hierarchical data is ubiquitous in the social sciences where 
measurement occurs at different levels of aggregation. 


> e.g. we collect measurements of individuals who live in a 
certain locality or belong to a particular race or social group. 


6 / 20 


Ë Jathshala 
A 


Hierarchical Models IO OTS eITeIT 


» Hierarchical data is ubiquitous in the social sciences where 
measurement occurs at different levels of aggregation. 
> e.g. we collect measurements of individuals who live in a 
certain locality or belong to a particular race or social group. 


» When this occurs, standard techniques either assume that 
these groups belong to entirely different populations or ignore 
the aggregate information entirely. 


Ë Jathshala 
A 


Hierarchical Models "X UTOSITeIT 


» Hierarchical data is ubiquitous in the social sciences where 
measurement occurs at different levels of aggregation. 
> e.g. we collect measurements of individuals who live in a 
certain locality or belong to a particular race or social group. 


» When this occurs, standard techniques either assume that 
these groups belong to entirely different populations or ignore 
the aggregate information entirely. 


> Hierarchical models provide a way of pooling the information 
for the disparate groups without assuming that they belong to 
precisely the same population. 


d D Jathshala 
Hierarchical Models e$ UTOSITeIT 


» Suppose we have collected data about some random variable 
Y from m different populations with n observations for each 
population. 


athshala 
Hierarchical Models Qüsteem t 


» Suppose we have collected data about some random variable 
Y from m different populations with n observations for each 
population. 


> Let Y;; represent observation j from population i 


e Jathshala 
Hierarchical Models © S UTOSITeIT 


» Suppose we have collected data about some random variable 
Y from m different populations with n observations for each 
population. 


> Let Y;; represent observation j from population i 


> Suppose Y;;  f(0;) where 0; is a vector of parameters of 
population 7. 


MHRD 
J]IN(Govt. of India 
EG) 


e Jathshala 
Hierarchical Models € S OTST AT 


» Suppose we have collected data about some random variable 
Y from m different populations with n observations for each 
population. 


> Let Y; represent observation j from population i 


> Suppose Y;;  f(0;) where 0; is a vector of parameters of 
population 7. 


> Further 0; ~ f(O) where © may be a vector. 


MHRD 


T ovt. of India 


e Jathshala 
A 


Hierarchical Models "S UTOSITeIT 


» Suppose we have collected data about some random variable 
Y from m different populations with n observations for each 
population. 


> Let Y;; represent observation j from population 7 


> Suppose Y;;  f(0;) where 0; is a vector of parameters of 
population 7. 


> Further 0; ~ f(O) where © may be a vector. 


» Note, until this point this is just a standard Bayesian setup 
where we are assigning some prior distribution for the 
parameters @ that govern the distribution of y. 


m ———————— VI —— — 


athshala 
Hierarchical Models Aahe t 


» Now we extend the model, and assume that the parameters 
O11, O12 that govern the distribution of the O's are 
themselves random variables and assign a prior distribution to 
these variables as well: 


O ~ f (o, B) 


MHRD 


Iil Govt. of India 


Ë Jathshala 
Hierarchical Models € S UTOSITelIT 


» Now we extend the model, and assume that the parameters 
O11, O12 that govern the distribution of the O's are 
themselves random variables and assign a prior distribution to 
these variables as well: 


O ~ f (o, B) 


> © is called hyperprior. The parameters a, b, c, d for the 
hyperprior may be "known" and represent our prior beliefs 
about O. 


MHRD 


T ovt. of India 


Ë Jathshala 
A 


Hierarchical Models "S UTOSITeIT 


» Now we extend the model, and assume that the parameters 
O11, O12 that govern the distribution of the O's are 
themselves random variables and assign a prior distribution to 
these variables as well: 


O ~ f (o, B) 


> © is called hyperprior. The parameters a, b, c, d for the 
hyperprior may be "known" and represent our prior beliefs 
about O. 


> |n theory, we can also assign a probability distribution for these 
quantities as well, and proceed to another layer of hierarchy. 


Hierarchical Models 


a, B 


6, 02 03 T iN TM e O79 O74 


uu y2 Ys ee ee ee e WO Y7 


Ak athshala 
Exchangeability )uresmeT aye 


Exchangeability : Formal 


The parameters 01,02, ..., On are exchangeable in their joint 
distribution if p(04, 02, ...,0,,) is invariant to permutations in the 


index 1,2, ...,n 


10 / 20 


à Jathshala 
A 


Exchangeability “HUTS AT 


Exchangeability : Formal 


The parameters 01,02, ..., 0,, are exchangeable in their joint 
distribution if p(0, 05, ...,0,,) is invariant to permutations in the 
index 1,2, ...,n. 


Exchangeability : Informal 


If no information other than the data is available to distinguish any 
of the 05's form any of the others, and no ordering of the 
parameters can be made, one must assume symmetry among the 
parameters in the prior distribution. 


This concept is closely related to the concept of identically and 
independent random variables where, conditional on the data, each 
observation is treated the same. 


10 / 20 


@Qathshala : 
Application: Poisson-Gamma Model QUITS eT 


> Robert etal. (2004) presents data of the number of failures 
(yi) for each of 10 pumps in a nuclear plant? 


11/20 


athshala 
Application: Poisson-Gamma Model eder 


> Robert etal. (2004) presents data of the number of failures 
(yi) for each of 10 pumps in a nuclear plant? 


> We also have the times (t;) at which each pump was observed. 


11/20 


t. of India 


Ë Jathshala 
Application: Poisson-Gamma Model eS UST 


> Robert etal. (2004) presents data of the number of failures 
(yi) for each of 10 pumps in a nuclear plant? 


> We also have the times (t;) at which each pump was observed. 


> To model this process, we assume that the number failure 
follows a Poisson distribution. 


failure; ~ Poisson(A;t;) 


11/20 


t. of India 


Ë Jathshala 
Application: Poisson-Gamma Model © e UTS TT 


> Robert etal. (2004) presents data of the number of failures 
(yi) for each of 10 pumps in a nuclear plant? 


> We also have the times (t;) at which each pump was observed. 


> To model this process, we assume that the number failure 
follows a Poisson distribution. 


failure; ~ Poisson(A;t;) 


Q1 How would we address this question? 


11/20 


t. of India 


Ë Jathshala 
Application: Poisson-Gamma Model eS UST aT 


> Robert etal. (2004) presents data of the number of failures 
(yi) for each of 10 pumps in a nuclear plant? 


> We also have the times (t;) at which each pump was observed. 


> To model this process, we assume that the number failure 
follows a Poisson distribution. 


failure; ~ Poisson(A;t;) 
Q1 How would we address this question? 


Q2 Why might we model this as a hierarchical process? 


11/20 


208 Jathshala 
Exchangeability «S UTOSITeIT 


> Exchangeability means that we can treat the parameters for 
each sub-population as exchangeable units. 


MHRD 
J]IN (Govt. of India 
RO, 


T CQ athshata 
Exchangeability +) UTGITeIT 


» Exchangeability means that we can treat the parameters for 
each sub-population as exchangeable units. 


> In its simplest form, each parameter 6; is treated as an 
independent sample from a distribution governed by unknown 
parameter vector O. 


p(61,65, ...,6,|8) = [ [»(6;16) 


MHRD 
J]IN (Govt. of India 


Ë Jathshala 
A 


Exchangeability Ne arsena 


» Exchangeability means that we can treat the parameters for 
each sub-population as exchangeable units. 


> In its simplest form, each parameter 6; is treated as an 
independent sample from a distribution governed by unknown 
parameter vector O. 


p(61,65, ...,6,|8) = [ [»(6i16) 


> |n a more general form, we may also condition on data that 
we have about the different sub-populations. 


Exchangeability 


» We can write the joint prior distribution as: 


p(01, 05, sad On, 6) = p(1, 05, Vies §n|O)p(O) 


13 / 20 


Ak athshala (© RD 
Exchangeability JUTOSITeIT aay 


> We can write the joint prior distribution as: 


p(1, 05, sad On, 6) = p(1, 05, Vies §n|O)p(O) 


> By Baye's Rule 


p(61, ..., On, O|Y) « prior x likelihood for Y 


13 / 20 


08 Jathshala 
Application: Poisson-Gamma Model v 9 UTOSITeIT 


» We consider the data model to be: 
yi  Poisson(Aiti) for i — 1,...,10. 


> To model this as a hierarchical process, we assume that each 
of the absence A; are exchangeable draws from a common 
distribution. 


14 / 20 


MHRD 
J]IN Govt. of India 
so 


a Jathshala 
© S WOM AT 


Application: Poisson-Gamma Model 
» We consider the data model to be: 
yi  Poisson(Aiti) for i — 1,...,10. 


> To model this as a hierarchical process, we assume that each 
of the absence A; are exchangeable draws from a common 
distribution. 


> In this case, the gamma distribution has desirable properties. 
Ai  Gamma(a,) fori — 1, ..., 10. 


Note that o = 1.8 and @ are unknown parameters. 


14 / 20 


Application: Poisson-Gamma Model 


> To satisfy the requirement of exchangeability, what must we 
assume about the data generating process? 


15 / 20 


Ë Jathshala 
v. 


Application: Poisson-Gamma Model 9 uTóSITelT 


> To satisfy the requirement of exchangeability, what must we 
assume about the data generating process? 


» Finally, to complete the hierarchical structure, we must assign 
"hyperpriors" for the parameters on £. Again, the gamma 
distribution has nice properties, so we assume that: 


B ~ Gamma(v, à) 


15 / 20 


MHRD 
Hil Govt. of India 
RO, 


@Qethshala 

Application: Poisson-Gamma Model eS UST 

> To satisfy the requirement of exchangeability, what must we 
assume about the data generating process? 


> Finally, to complete the hierarchical structure, we must assign 
"hyperpriors" for the parameters on £. Again, the gamma 
distribution has nice properties, so we assume that: 


B ~ Gamma(v, à) 


> assign Gamma(v,d) on unknow B with v = 0.001 and ô = I. 


15 / 20 


Application: Poisson-Gamma Model 


> The joint posterior distribution is: 


PAi Bly, t) « [ [ PoisQutilu) xGamma(Xo, 8)Gamma(B|v., 6) 


16 / 20 


. Ak athshala RD 
Application: Poisson-Gamma Model JUTOSITeIT aay 


> The joint posterior distribution is: 


P(r; Bly, t) x [ [ Pois(Aitily:)x Gamma(àila, 8)Gamma(B|v, 6) 


> Using our trick for conditional distributions, we know 


p(Ai|B, y, t) ~ Gamma(yi + a, ti + B) 


16 / 20 


athshala (i HRD 
Application: Poisson-Gamma Model estet as 


> The joint posterior distribution is: 


P(r; Bly, t) x [ [ Pois(Aitily:)x Gamma(àila, 8)Gamma(B|v, 6) 


> Using our trick for conditional distributions, we know 


p(Ai|B, y, t) ~ Gamma(yi + a, ti + B) 


> p(B|Ai, y, t) ~ Gamma(10A + v, 6 + Da 0A;) 


16 / 20 


t. of India 


athshala 
Application: Poisson-Gamma Model eder 


Posterior mean of lambda's : 


[1] 0.07063148 0.15167434 0.10484798 0.12280676 0.65433707 
[7] 0.84803853 0.86260141 1.36635800 1.92780912 


Posterior mean of beta : 


[1] 2.393286 


17 / 20 


Ë Jathshala 
9ST d" 


Trace Plot of lambda1 


0.05 0.10 0.15 0.20 


0 2000 4000 6000 8000 


Time 


Ë Jathshala z 
Application: Poisson-Gamma Model SIUSA ey" 


Posterior density of lambda1 


LO 
m 
[e 
>=> -— 
m 
[7] 
c 
[9] 
"o 
te) 
oO 


0.00 005 0.10 0.15 0.20 


lambda 


Application: Poisson-Gamma Model 


> As we assume A; ~ Gamma/(a = 1.8, 8) and estimated 


Ê = 2.39 


Application: Poisson-Gamma Model 


> As we assume A; ~ Gamma(a = 1.8, 8) and estimated 


p = 2.39 


> 95% CI of Gamma(a = 1.8, Ê = 2.39) 
[1] 0.07631031 2.17514481 


athshala 
Application: Poisson-Gamma Model etse 


> As we assume A; ~ Gamma/(a = 1.8, 8) and estimated 
B = 2.39 


> 95% Cl of Gamma(a = 1.8, Ê = 2.39) 
[1] 0.07631031 2.17514481 


> |t indicates A, might be an outlier 


athshala 
GR |) ITO SITeIT 


A Gateway to all Postgraduate Courses An MHRD project under its National Mission on Education through ICT (NME-ICT) 


Subject: Statistics 


Paper: Statistical Inference 
Module: Monte Carlo Intergration and 
Simulation Technique - Part 1 


MHRD 


Iil Govt. of India 


@Qethshala 
Development Team eS UST 


Principal investigator: Dr. Bhaswati Ganguli, Professor, 
Department of Statistics, University of Calcutta 

Paper co-ordinator: Dr. Dipak K Dey, Associate Dean and BOT 
Distinguished Professor, Department of Statistics, 
University of Connecticut 

Content writer: Dr. Sourish Das, Assistant Professor, Chennai 
Mathematical Institute 


Content reviewer: Department of Statistics, University of Calcutta 


t. of India 


Introduction 


> Suppose Y ~ p(y|0) and p(0) is the prior distribution over 0 


vt. of India 


athshala 
Introduction C TIT 


> Suppose Y ~ p(y|0) and p(0) is the prior distribution over 0 


» Our objective is to make statistical inference about 0 


athshala b RD 
Introduction eder a» 


> Suppose Y ~ p(y|0) and p(0) is the prior distribution over 0 
» Our objective is to make statistical inference about 0 


> The posterior distribution is: 


p(y|9)p(0) 
Jo »(v|9)p(0)d0 


ply) = 


a Jathshala i 
Introduction SQC ey” 


> Suppose Y ~ p(y|0) and p(0) is the prior distribution over 0 


> Our objective is to make statistical inference about 0 


M 


The posterior distribution is: 


p(y|9)p(0) 
Jo »(v|9)p(0)d0 


ply) = 


> Objective is to estimate the posterior mean: 


E(0|y) = no = g(y) 


Introduction 


> In order to get the posterior mean we have to solve this 
integration 


208 J2athshala 
Introduction v O UTOSITeIT 


> In order to get the posterior mean we have to solve this 
integration 


» Analytical solution does not exists for many sophisticated 
models 


athshala 
Introduction Qüsteer t 


> In order to get the posterior mean we have to solve this 
integration 


» Analytical solution does not exists for many sophisticated 
models 


> So we have to resort to simulation technique to solve this 
integration problem. 


MHRD 
HiK (Govt. of India 
EO, 


, @Qethshala 
Introduction eS UTOSITelIT 
> In order to get the posterior mean we have to solve this 
integration 


> Analytical solution does not exists for many sophisticated 
models 


> So we have to resort to simulation technique to solve this 
integration problem. 


> Typically it is known as Monte Carlo Integration method 


Monte Carlo Methods 


» Monte Carlo methods rely on 


> The possibility of generation of endless flow of random 
variables 


d D Jathshala 
Monte Carlo Methods T O UTO SITeIT 


» Monte Carlo methods rely on 


> The possibility of generation of endless flow of random 
variables 


» For well-known or new distributions. 


athshala 
Monte Carlo Methods Qüsteem t 


» Monte Carlo methods rely on 


> The possibility of generation of endless flow of random 
variables 


> For well-known or new distributions. 


> Such a simulation is based on the generation of uniform 
random variables on the interval (0, 1). 


MHRD 
J]IN (Govt. of India 
iO. 


ËR athshala 
Monte Carlo Methods © 9 YTS AT 


» Monte Carlo methods rely on 


> The possibility of generation of endless flow of random 
variables 


> For well-known or new distributions. 


» Such a simulation is based on the generation of uniform 
random variables on the interval (0, 1). 


» We are not concerned with the details of producing uniform 
random numbers. 


Ë Jathshala 
Monte Carlo Methods © 9 YTS AT 


» Monte Carlo methods rely on 
> The possibility of generation of endless flow of random 
variables 


» For well-known or new distributions. 


> Such a simulation is based on the generation of uniform 
random variables on the interval (0, 1). 


» We are not concerned with the details of producing uniform 
random numbers. 


» We assume the existence of such a sequence 


t. of India 


Monte Carlo Integration 


> As we want to estimate the posterior mean: 


E(0|y) = ] toov = g(y) 


t. of India 


. Ag athshala 
Monte Carlo Integration ) STS RIT AT 


> As we want to estimate the posterior mean: 


E(0|y) = ] ioo = g(y) 


we can do so by simulating random samples from p(0|y) 


t. of India 


. Ak athshala 
Monte Carlo Integration 9 uTóSITelT 
> As we want to estimate the posterior mean: 
E(0|y) = ] ioo = g(v) 
we can do so by simulating random samples from p(0|y) 


> Suppose (0!, ..., 0") are random samples from p(0|y), then we 
can approximate g(y) by 


es 
= We 


t. of India 


. Ak athshala 
Monte Carlo Integration JUS TAT 
> As we want to estimate the posterior mean: 
E(0|y) = ] ioo = g(v) 
we can do so by simulating random samples from p(0|y) 


> Suppose (0!, ..., 0N) are random samples from p(0|y), then we 
can approximate g(y) by 


DS 
p 2 
» |f we can ensure simple random sample then SLLN ensures 


1 N 
ĝ(y) = y 2, > g(y) = E(0|y) 


as N — oo, where N is the simulation size. 


di Jathshala 
Using the R Generator fe) 


> R has a large number of functions that will generate random 
samples from standard distributions. 


MHRD 
JJ Govt. of India 
Q2 


athshala 
Using the R Generator Getestet 


bm 


» R has a large number of functions that will generate random 
samples from standard distributions. 


» |f built-in function in R is available then we can use it directly. 


MHRD 


Iil Govt. of India 


@Qethshala 
Using the R Generator © S) UTSATT 


> R has a large number of functions that will generate random 
samples from standard distributions. 


> If built-in function in R is available then we can use it directly. 


» However, if the built-in functions are not available and the 
posterior distribution is known upto its kernel then we have to 
use genric methods to draw samples from such distributions. 


The Inverse Transform 


> The probability integral transform allows us to transform a 
uniform into any random variable. 


athshala 
The Inverse Transform Getestet 


> The probability integral transform allows us to transform a 
uniform into any random variable. 


> If X has a density f and cdf F, then we have the relation 


" T i fida 


we know F(x) ~ Unif(0,1) - so we set U = F(x) and solve 
for x 


a Jathshala 
v. 


The Inverse Transform YTS AT 


> The probability integral transform allows us to transform a 
uniform into any random variable. 


> If X has a density f and cdf F, then we have the relation 
FG)- | fotu 
—oo 


we know F(x) ~ Unif(0,1) - so we set U = F(x) and solve 
for x 


> If X ~ Exp(1) then F(x) 21—e * 


> solving for x in u = 1 — e™” gives x = — log(1 — u) 


MHRD 


ovt. of India 


Generating Exponentially Distributed Ë athshala 
Random Samples Secret 


Exp from Uniforms Exp from Built-in R 
8o 
e 
o D 
o J o 
E 2 
2 3 - a $- 
o 9 o 
à a 
N 
| e 
o } o la 
e LT T T 1 e T TT 
01234567 01234656 


x y 


9 / 20 


d Jathshala 
The Inverse Transform from Uniforms Got 


>» This method is useful for other probability distributions ; ones 
obtained as a transformation of uniform random variables 


10 / 20 


athshala 
The Inverse Transform from Uniforms Getestet 


> This method is useful for other probability distributions ; ones 
obtained as a transformation of uniform random variables 


» Logistic cdf: 
1 


H 1 + exp (— 334) 


» Cauchy cdf: 


F(x) = ; + “arctan( (x — p)/o) 


10 / 20 


vt. of India 


athshala 
General Transform Method estet 


> dE X; Exp(1); three standard distribution can be derived as 


> Y = 2001 Xi ~ Xn 
> Y =b; Xi ~ Gamma(n, B) 


b Y = EA ^ Beta(ni, n2) 
where n € N = {1,2,...} 


11/20 


vt. of India 


athshala 
General Transform Method estet 


> df X; Exp(1); three standard distribution can be derived as 


-Y-2Y7 Xi» xd, 
> Y= BY ai Xi ~w Gamma(n, B) 
É Y — SELES ~w Beta(nı, n2) 


where n € N = {1,2,...} 


> These transformations are quite simple and we will use them 
quite often 


11/20 


MHRD 
J]IN(Govt. of India 
GO 


Ak athshala 
General Transform Method S) TO SITeIT 
> df XQ Exp(1); three standard distribution can be derived as 
> Y= 2) 3 Xi d Xn. 
> Y =b; Xi ~ Gamma(n, B) 
» Y = M pe 1Xi 


mits x, ~ Beta(nı, n2) 


where n € N = 112.4] 


> These transformations are quite simple and we will use them 
quite often 


> However, there is a limit to their usefullness. Only when 
closed form CDF is available the method is available. 


11/20 


Ii Govt. of India 


athshala 
General Transform Method Geste 


> d£ X; Exp(1); three standard distribution can be derived as 


> Y= 25 Xi dx XÀn 
> Y —B834Xi- Gamma(n, B) 


p dus Dua ^ Beta(ni,n3) 


where n € N = (1, dox 


> These transformations are quite simple and we will use them 
quite often 


» However, there is a limit to their usefullness. Only when 
closed form CDF is available the method is available. 


» The method will not work for Gaussian distribution as the 
CDF is not in closed form. 


11/20 


d Jathshala 
General Transform Method GGorsnee 


> Box-Muller transform to generate from Normal distribution 


t. of India 


di Jathshala 
General Transform Method GGorsnee 


> Box-Muller transform to generate from Normal distribution 


> If Uy and U2 are iid from Uni f (0, 1) 


Ak athshala RD 
General Transform Method JUTOSITeIT aay" 
> Box-Muller transform to generate from Normal distribution 
> If Uy and Us are iid from Uni f (0, 1) 


> The variable X; and X5 


X; = y -21og(U1)cos(21U53), and X; = \/—2log(Uj) sin(27U2) 


are iid N(0, 1) by virtue of a change of variable argument. 


@Qathshala 
General Transform Method SHSAT ay” 
> Box-Muller transform to generate from Normal distribution 

> If Uy and Us are iid from Umnif(0, 1) 


> The variable X; and X5 


X; = V-21og(U1)cos(21U3), and X; = y -21og(U1) sin(27U2) 


are iid N(0, 1) by virtue of a change of variable argument. 


> The Box-Muller algorithm is exact, not a crude CLT-based 
approximation 


é Jathshala 
oS UTOSITeIT 


General Transform Method 


» Box-Muller transform to generate from Normal distribution 
> If Uy and Us are iid from Umnif(0, 1) 


> The variable X; and X5 
X; = V-21og(U1)cos(21U3), and X; = y -21og(U1) sin(27U2) 
are iid N(0, 1) by virtue of a change of variable argument. 


> The Box-Muller algorithm is exact, not a crude CLT-based 
approximation 


» Note that this is not the generator implemented in R. It uses 
the probability inverse transform with a very accurate 
representation of the normal cdf 


vt. of India 


Ei Jathshala 
Accept-Reject Method God 


> There are many distributions where transform methods fail 


13 / 20 


d D Jathshala 
Accept-Reject Method «S TST 


» There are many distributions where transform methods fail 


> For these cases, we must turn to indirect methods 
» We generate a candidate random variable 


13 / 20 


athshala 
Accept-Reject Method Aahe t 


> There are many distributions where transform methods fail 


> For these cases, we must turn to indirect methods 
» We generate a candidate random variable 


> Only accept it subject to passing a test 


13 / 20 


MHRD 
J]IM (Govt. of India 
SO, 


Ë Jathshala 
Accept-Reject Method QUST 
> There are many distributions where transform methods fail 


> For these cases, we must turn to indirect methods 
> We generate a candidate random variable 


> Only accept it subject to passing a test 


> This class of ‘Accept-Reject’ methods is extremely powerful. 
It will allow us to simulate from virtually any distribution. 


13 / 20 


MHRD 


Tn ovt. of India 


e Jathshala 
A 


Accept-Reject Method "$ UTSTAT 


> There are many distributions where transform methods fail 


> For these cases, we must turn to indirect methods 
» We generate a candidate random variable 


> Only accept it subject to passing a test 


> This class of ‘Accept-Reject’ methods is extremely powerful. 
It will allow us to simulate from virtually any distribution. 


> Accept-Reject Methods: 
» Only require the functional form of the density f of interest 


> f: target density, g: candidate density 
where it is simpler to simulate random variables from g 


13 / 20 


vt. of India 


di p TESTES 
Accept-Reject Methods Di 


> The only constraints we impose on this candidate density g 
> f and g have compatible supports (i.e., g(x) > 0 when 


f(x) > 0). 


vt. of India 


athshala 
Accept-Reject Methods estet 


> The only constraints we impose on this candidate density g 
> f and g have compatible supports (i.e., g(x) > 0 when 


f(x) > 0). 


> X ~ f can be simulated as follows: 
> Generate Y ~ g and independently generate U ~ uni (0, 1) 


14 / 20 


vt. of India 


athshala 
Accept-Reject Methods etse 


> The only constraints we impose on this candidate density g 
> f and g have compatible supports (i.e., g(x) > 0 when 


f(x) > 0). 
> X ~ f can be simulated as follows: 


> Generate Y ~ g and independently generate U ~ uni (0, 1) 


>» fU < < d set X=Y 


5 


14 / 20 


vt. of India 


athshala 
Accept-Reject Methods estet 


» The only constraints we impose on this candidate density g 


> f and g have compatible supports (i.e., g(x) > 0 when 
f(x) > 0). 


> X ~ f can be simulated as follows: 
> Generate Y ~ g and independently generate U ~ unif(0, 1) 


> FU <} set X=Y 


> If inequality is not satisfied then discard Y and U and start 
again. 


14 / 20 


vt. of India 


athshala 
Accept-Reject Methods estet 


» The only constraints we impose on this candidate density g 
> f and g have compatible supports (i.e., g(x) > 0 when 


f(x) > 0). 
> X ~ f can be simulated as follows: 
> Generate Y ~ g and independently generate U ~ uni (0, 1) 


> FU <} set X=Y 


> If inequality is not satisfied then discard Y and U and start 
again. 


> Note M = sup, Ao 


14 / 20 


vt. of India 


athshala 
Accept-Reject Methods Geste 


» The only constraints we impose on this candidate density g 
> f and g have compatible supports (i.e., g(x) > 0 when 


f(x) > 0). 
> X ~ f can be simulated as follows: 
> Generate Y ~ g and independently generate U ~ uni (0, 1) 


> FU < 7A set X=Y 


> If inequality is not satisfied then discard Y and U and start 


again. 


> Note M = sup, D 


> P(Accept) = m and expected waiting time M. 


14 / 20 


athshala (© 
Accept-Reject Methods eder di tna 


Accept-Reject Algorithm 


1. Generate Y ~ g, U ~ uinf (0,1) 
2, Accept X =Y if U < f(Y)/Mg(Y) 


3. Return to 1 otherwise 


15 / 20 


Accept-Reject Methods: Theory 


» Why does this method work? 


16 / 20 


. Ag athshala RD 
Accept-Reject Methods: Theory JUTOSITeIT aay" 
» Why does this method work? 


» A straightforward probability calculation shows 


P(Y < z|Accept) = P (v < aU « x5) = P(X <2), 


where Y ~ g and X ^ f. 


16 / 20 


. Ag athshala RD 
Accept-Reject Methods: Theory JUTOSITeIT aay" 
» Why does this method work? 


» A straightforward probability calculation shows 


P(Y < z|Aécept) = P (v < aU « x5) = P(X <2), 


where Y ~ g and X ^ f. 


» Simulating from g, the output of this algorithm is exactly 
distributed from f. 


Ë Jathshala 
es UTOS3ITelIT Go m 


Accept-Reject Methods: Theory 
» Why does this method work? 


> A straightforward probability calculation shows 


0) 
P(Y € x|Accept) = P | Y € x|U < = P(X € zx), 
where Y ~ g and X ~ f. 


» Simulating from g, the output of this algorithm is exactly 
distributed from f. 


» The Accept-Reject method is applicable in any dimension. 


Ë Jathshala 
Accept-Reject Methods: Theory SHST ay” 
» Why does this method work? 


> A straightforward probability calculation shows 


0) 
P(Y € x|Accept) = P | Y € x|U < = P(X € zx), 
where Y ~ g and X ~ f. 


» Simulating from g, the output of this algorithm is exactly 
distributed from f. 


» The Accept-Reject method is applicable in any dimension. 


> As long as g is a density over the same space as f. 


@Qethshala 
Accept-Reject Methods: Theory QUST 


» Why does this method work? 


> A straightforward probability calculation shows 


f) Y pry eg 
no POEM 


P(Y < x|Accept) = P (v za 
where Y ~ g and X ~ f. 


» Simulating from g, the output of this algorithm is exactly 
distributed from f. 


> The Accept-Reject method is applicable in any dimension. 
> As long as g is a density over the same space as f. 
> Only need to know f/g upto a constant 


@Qethshala 
Accept-Reject Methods: Theory QUST 


» Why does this method work? 


> A straightforward probability calculation shows 


f) Y pry eg 
no POEM 


P(Y < x|Accept) = P (v za 
where Y ~ g and X ~ f. 


» Simulating from g, the output of this algorithm is exactly 
distributed from f. 


> The Accept-Reject method is applicable in any dimension. 
> As long as g is a density over the same space as f. 


> Only need to know f/g upto a constant 


a Jathshala ' 
Accept-Rejection Algorithm for Beta Distrie S YTOSITeIT 


> Generate X ~ Beta(a, b) 


17 / 20 


t. of India 


a Jathshala ' 
Accept-Rejection Algorithm for Beta Distrie HUTS MeT 


> Generate X ~ Beta(a, b) 


» No direct method available if a and b are not integers 


17 / 20 


a Jathshala ' 
Accept-Rejection Algorithm for Beta DistriNe S UTOSITelT 


> Generate X ~ Beta(a, b) 
» No direct method available if a and b are not integers 


> Generate for a = 2.3 and b = 7.9 


17 / 20 


t. of India 


di Jathshala 
Accept-Rejection Algorithm for Beta Distri ATST AT 


v 


Generate X ~ Beta(a, b) 


» No direct method available if a and b are not integers 


> Generate for a = 2.3 and b = 7.9 


» We can generate if a and b is integer 


17 / 20 


MHRD 


ovt. of India 


Accept-Rejection Algorithm dienas 
H UTSATT 


for Beta Distribution ry 


> Candidate distribution Uni f (0, 1) 
> Target distribution Beta(2.3, 7.9) 


Acceptance Rate 


[1] 32.84 
Histogram of X Histogram of candidate density 
| 2 
2 IM 
e ji I [e] 
l| i E 
zoll l Boo 
2 wv Ih 8g 2 
á i á x 
2 Hil = 
lik : 
[ i lia. : o 
S f T T T T 1 2 T T T T 1 
0.0 02 04 06 08 1.0 0.0 02 04 06 08 1.0 
X sample.x 


18 / 20 


MHRD 


ovt. of India 


Accept-Rejection Algorithm dienas 
H UTSATT 


for Beta Distribution KEI 


» Candidate distribution Beta(1, 5) 
> Target distribution Beta(2.3, 7.9) 


Acceptance Rate 


[1] 32.42 
Histogram of X Histogram of candidate density 
qu 
Y * 
o 4 | l 
E TM =>” 
$ S] | n Bw 
a | h Q 
| M 
. EA... i | 
r T T T T 1 rs a E —4 
0.0 0.2 0.4 0.6 0.8 1.0 00 02 04 06 0.8 
x sample.x 


19 / 20 


Accept-Rejection Algorithm dienas 


for Beta Distribution 


» Candidate distribution Beta(2, 7) 
» Target distribution Beta(2.3, 7.9) 


Acceptance Rate : 


[1] 33.06 
Histogram of X Histogram of candidate density 
2 
e 
i} 
e 
o 
2 2 9S 
goa | a 
oO W oO 
l |; E 
"s h J E Z 
mci T T 1 So an ed as cae 
0.0 da j4 0.6 0.8 1.0 00 02 04 06 08 


X sample.x 


YTS AT 


athshala 
GR |) OTST 


A Gateway to all Postgraduate Courses An MHRD project under its National Mission on Education through ICT (NME-ICT) 


Subject: Statistics 


Paper: Statistical Inference 
Module: Monte Carlo Intergration and 
Simulation Technique - Part 2 


MHRD 


Iil Govt. of India 


@Qethshala 
Development Team oS UST AT 


Principal investigator: Dr. Bhaswati Ganguli, Professor, 
Department of Statistics, University of Calcutta 

Paper co-ordinator: Dr. Dipak K Dey, Associate Dean and BOT 
Distinguished Professor, Department of Statistics, 
University of Connecticut 

Content writer: Dr. Sourish Das, Assistant Professor, Chennai 
Mathematical Institute 


Content reviewer: Department of Statistics, University of Calcutta 


t. of India 


Introduction 


> Suppose Y ~ p(y|0) and p(0) is the prior distribution over 0 


vt. of India 


athshala 
Introduction C TIT 


> Suppose Y ~ p(y|0) and p(0) is the prior distribution over 0 


» Our objective is to make statistical inference about 0 


athshala b RD 
Introduction eder a» 


> Suppose Y ~ p(y|0) and p(0) is the prior distribution over 0 
» Our objective is to make statistical inference about 0 


> The posterior distribution is: 


p(y|9)p(0) 
Jo »(v|9)p(0)d0 


ply) = 


a Jathshala i 
Introduction SQC ey” 


> Suppose Y ~ p(y|0) and p(0) is the prior distribution over 0 


> Our objective is to make statistical inference about 0 


M 


The posterior distribution is: 


p(y|9)p(0) 
Jo »(v|9)p(0)d0 


ply) = 


> Objective is to estimate the posterior mean: 


E(0|y) = no = g(y) 


a Jathshala E RD 
Monte Carlo Integration SIUSA ay" 


> So we will be concerned with evaluating integrals of the form 


E(h(8)ly) = f h(0)p(0ly)d6 


. Ak athshala |i RD 
Monte Carlo Integration JUTOSITeIT dis" 


> So we will be concerned with evaluating integrals of the form 


6)y) = f h(0)p(0ly)d6 


> p() is the posterior probability density 


. Ak athshala RD 
Monte Carlo Integration ) uTOSITeIT a 


> So we will be concerned with evaluating integrals of the form 
lu) = f mO)n(0 yao 
> p() is the posterior probability density 


> We can generate finitely many random samples (61, ..., 0”) 
from p(6|y) 


. Ak athshala RD 
Monte Carlo Integration )uTOSITeT ay” 


> So we will be concerned with evaluating integrals of the form 
lu) = f mO)n(0 yao 
> p() is the posterior probability density 


» We can generate finitely many random samples (61, ..., 0”) 
from p(6|y) 


> Approximate the integral with 


Monte Carlo Integration 


» Convergence: 


Monte Carlo Integration 


» Convergence: 


> CLT: 
R- BOW) 1, wy 


on 


p 4 = 1, [k — E(0|y)?, such that 
Jar (Oly) )d0 < oo 


Monte Carlo Integration 


> The advantage of CLT is we can evaluate the Monte Carlo 
error 


208 Jathshala 
Monte Carlo Integration «S UTOSITeIT 


> The advantage of CLT is we can evaluate the Monte Carlo 
error 


> |t assumes c? is the proper estimate of the variance of hn 


a Jathshala 
v. 


Monte Carlo Integration 9 uTóSITelT 


> The advantage of CLT is we can evaluate the Monte Carlo 
error 


> |t assumes c2 is the proper estimate of the variance of hy, 


> If c? does not converge, converges too slowly, a CLT may not 
apply. In such cases we will not be able to estimate the Monte 
Carlo error. 


e Jathshala 
Importance Sampling +9 TOSITeIT 


> Importance sampling is based on an alternative formulation of 
the SLLN 


208 Jathshala 
Importance Sampling «S UTOSITeIT 


> Importance sampling is based on an alternative formulation of 
the SLLN 


> For notational convinience we assume p(0|y) = f (0) 


Ë Jathshala 
Importance Sampling eS UST oT 


> Importance sampling is based on an alternative formulation of 
the SLLN 


> For notational convinience we assume p(0|y) = f (0) 


Ej(h(0)) 


I 
= 
D 
bun d 
= 
y 
D 
ese 
Q 
D 


> f is the target density 
> gis the cadidate density 


SCY 017205 


Importance Sampling 


> So in ‘Importance Sampling’ all you do is first you generate 
samples from candidate distribution g 


vt. of India 


Ak athshala 
Importance Sampling ) TO SITeIT 


> So in ‘Importance Sampling’ all you do is first you generate 
samples from candidate distribution g 


> Suppose (01,02, ..., 0^) are the random samples generated 
from g 


t. of India 


Ak athshala 
Importance Sampling ) TO SITeIT 


> So in ‘Importance Sampling’ all you do is first you generate 
samples from candidate distribution g 


> Suppose (01,02, ..., 0^) are the random samples generated 
from g 


> You estimates 


Ë Jathshala 
v 


Importance Sampling S uTOSITelT 


> So in ‘Importance Sampling’ all you do is first you generate 
samples from candidate distribution g 


> Suppose (07, 02, ..., 0") are the random samples generated 
from g 


> You estimates 


and by virtue of SLLN we have 


5 lof), pi h(8)f(0)| | 
h= 2, e) 0) zl (| = Ej) 


Importance Sampling 


> The logic underlying importance sampling lies in a simple 


rearrangement of terms in the target integral and multiplying 
by 1: 


Jj h(6) f (8)d6 = J (6) 


here g() is another density whose support is same as /(). 


Te 


Loo = f MOOO 


Ë Jathshala 
Importance Sampling COWS MTA ay" 


> The logic underlying importance sampling lies in a simple 
rearrangement of terms in the target integral and multiplying 
by 1: 


i h(6) f (8)d6 = J noT aoas 7 J ORORO 


here g() is another density whose support is same as f(). 


> w() is called importance function 


@Qethshala 
Importance Sampling SHSAT ay” 


> The logic underlying importance sampling lies in a simple 
rearrangement of terms in the target integral and multiplying 
by 1: 


i h(6) f (8)d6 = J noT aoas 7 J ORORO 


here g() is another density whose support is same as f(). 
> w() is called importance function 


> a good importance function will be large when the integrand 
is large and small otherwise. 


Importance Sampling (IS) 


» |S to improve integral approximation - it reduces the variance 
of an integral approximation. 


208 Jathshala 
Importance Sampling (IS) v" UTOSITeIT 


» |S to improve integral approximation - it reduces the variance 
of an integral approximation. 


» Another objective of IS when you cannot generate from the 
density p 


Ë Jathshala 
© S WOM AT 


Importance Sampling (IS) 


> IS to improve integral approximation - it reduces the variance 
of an integral approximation. 


» Another objective of IS when you cannot generate from the 
density p 


> A third objective of IS is to draw inference or generate sample 
when the target density is unnormalized 


10 / 27 


Importance Sampling to Ë athshala 
improve integral approximation S ELSE 


Improve integral approximation 


Consider the function h(x) = 10exp(—2|x — 5|). Suppose that we 
want to calculate E(h(.X)), where X ~ Uniform(0,10). That is 
we want to calculate 


10 
ji exp(—2|x — 5|)dz. 
0 


The true value for this integral is about 1. 


> The simple way to do this is to generate x; from the 
Uniform(0, 10) and look at the sample mean of A(z;) 


11/27 


Importance Sampling to S athshala 
improve integral approximation CAUSAT db" 


Improve integral approximation 


Consider the function h(x) = 10 exp(—2|x — 5|). Suppose that we 
want to calculate E(h(.X)), where X ~ Uniform(0,10). That is 
we want to calculate 


10 
ji exp(—2|x — 5|)dz. 
0 


The true value for this integral is about 1. 


> The simple way to do this is to generate x; from the 
Uniform(0, 10) and look at the sample mean of h(z;) 


» Notice this is equivalent to importance sampling with 
importance function w(x) = f(x), where 
f(z) = IO < x < 10) 


athshala ` 
WJTOSITeIT 


MHRD 


Govt. of India 


Importance Sampling to Ë 
improve integral approximation DE 


> Note that h(x) = 10 exp(—2|x — 5|) where x ~ Unif(0, 10) 
» set.seed(7831) 
> sim.size«-50000 
» X«-runif(sim.size,0,10) 
> Y«-10*exp(-2*abs(X-5)) 
> c(mean(Y),var(Y)) 


[1] 0.999026 3.993018 


Importance Sampling to Ë athshala 
improve integral approximation IT 


> Note that h(x) = 10 exp(—2|a — 5|) where x ~ Unif(0, 10) 


co 
— 
xX 
C "t 
e 


Importance Sampling to 
improve integral approximation 


> Note that h(x) = 10 exp(—2|x — 5|) where x ~ Uni f (0, 10) 


MHRD 


ovt. of India 


Importance Sampling to Ë athshala 
improve integral approximation S EIE 


> Note that h(x) = 10 exp(—2ļ|x — 5|) where x ~ Uni f (0, 10) 


> The function h() in this case is peaked at 5, and decays 
quickly elsewhere, therefore, under the uniform distribution, 
many of the points are contributing very little to this 
expectation. 


14 / 27 


Importance Sampling to Ë athshala 
improve integral approximation S EIE 


> Note that h(x) = 10exp(-2|x — 5|) where x ~ Uni f (0, 10) 


> The function h() in this case is peaked at 5, and decays 
quickly elsewhere, therefore, under the uniform distribution, 
many of the points are contributing very little to this 
expectation. 


» Something more like a gaussian function ce-"^ with a peak at 
5 and small variance, say, 1, would provide greater precision 


nnn 


MHRD 


ovt. of India 


Importance Sampling to ai athshala 
improve integral approximation Queene d 


> Note that h(x) = 10exp(-2|x — 5|) where z ~ Uni f (0, 10) 


> The function h() in this case is peaked at 5, and decays 
quickly elsewhere, therefore, under the uniform distribution, 
many of the points are contributing very little to this 
expectation. 


> Something more like a gaussian function ce-^^ with a peak at 
5 and small variance, say, 1, would provide greater precision 


> We can re-write the integral as 


n 1/10 1 2 
10 2|z — 5 (5) Pg, 
f SDE Je 0-990 VR 


14 / 27 


Importance Sampling to 
improve integral approximation 


> We want to estimate E[h(.X )w(X)] where X ~ N(5, 1) 


Importance Sampling to 
improve integral approximation 


> We want to estimate E[h(.X )w(X)] where X ~ N(5, 1) 


> Soin this case f(x) ~ Uniform(0,10), g(x) ~ N(5,1) 


Importance Sampling to 
improve integral approximation 


> We want to estimate E[h(.X )w(X)] where X ~ N(5, 1) 


> So in this case f(x) ~ Uniform(0,10), g(x) ~ N(5,1) 


> The weight function is 


Importance Sampling to 
improve integral approximation 


» We implement the IS approach as 
> w«-function(x)dunif (x,0,10)/dnorm(x,mean=5, sd=1) 
h<-function (x) 10*exp (-2*abs (x-5)) 
set.seed (7831) 
sim.size«-50000 
x<-rnorm(sim.size,mean=5,sd=1) 
y<-h (x) *w (x) 
c(mean(y), var (y)) 


[1] 0.9995830 0.3609514 


VV MV MM 


Note that here Monte Carlo variance is much smaller 


Ë Jathshala 
=) UTS AIT oll 


t. of India 


Importance Sampling - Application 


Application: 
Suppose daily return of a stock is defined as 
IP — JR 
= ——— ——- x 100 
Tt BPa x 


where P, is the price of the stock on t^^ day. Suppose it is believed 
that return of the stock follows r; ~ N(0, 1). 

Objective: We want to estimate P(r; < —5), i.e., we want to 
estimate the probability that the price of the stock will drop more 
than 5% in one day? 


m lllllllll]] 1l | n SS 17°10 


Ë Jathshala RD 
Importance Sampling - Application COMSAT dS" 


> As we are interested in P(r; < —5) where r ~ N(0, 1), the 
probability in log-scales, i.e., log(P(r; < —5)) 
> log(pnorm(-5,mean-0,sd-1)) 


[1] -15.065 


. i . Ak athshala 
Importance Sampling - Application ) STS SITeIT 


> As we are interested in P(r; < —5) where r ~ N(0, 1), the 
probability in log-scales, i.e., log( P(r; < —5)) 
> log (pnorm(-5,mean=0, sd=1)) 


[1] -15.065 


> Simulating Z from N(0, 1) only produces a hit once in 
about 3 million iterations ! 


18 / 27 


t. of India 


Importance Sampling - Application 


> As we are interested in P(r; < —5) where r ~ N(0, 1), the 
probability in log-scales, i.e., log(P(r; < —5)) 
> log(pnorm(-5,mean-0,sd-1)) 


[1] -15.065 


> Simulating Z(? from N(0, 1) only produces a hit once in 
about 3 million iterations ! 


> This is a very rare event for standard normal distribution 


18 / 27 


t. of India 


Ë Jathshala 
Importance Sampling - Application © e UST 


> As we are interested in P(r; < —5) where r ~ N(0, 1), the 
probability in log-scales, i.e., log(P(r; < —5)) 
> log(pnorm(-5,mean-0,sd-1)) 


[1] -15.065 


> Simulating Z(? from N(0, 1) only produces a hit once in 
about 3 million iterations ! 


> This is a very rare event for standard normal distribution 


> Estimates from Importance sampling 
[1] -15.0732 


18 / 27 


Importance Sampling - Application 


Built-in R Importance Sampling 


log(P(r, < —1) -1.841 -1.865 
log(P(r,« —2)) -3.783 -3.803 
log(P(r;« —3)) | -6.608 -6.624 
log(P(r,« —4)) -10.360 -10.372 
log(P(r < —5)) -15.065 -15.073 


MHRD 


ovt. of India 


Importance Sampling Ë athshala 
for Bayesian Inference al 


» |n Bayesian inference is the primary example where you want 
to determine the properties of a probability distribution given 
upto its kernel function. 


Importance Sampling Ë athshala 
for Bayesian Inference IIS a 


» |n Bayesian inference is the primary example where you want 
to determine the properties of a probability distribution given 
upto its kernel function. 


> Suppose k(0) is the kernel function of the posterior 
distribution p(0|y), i.e., 


f(8) = p(8ly) = 
where C = f k(0)d0 = f p(y|0)p(0)d0 


? 


k(0) 
cC 


Importance Sampling Ë athshala 
for Bayesian Inference a 


> |n Bayesian inference is the primary example where you want 
to determine the properties of a probability distribution given 
upto its kernel function. 


> Suppose k(0) is the kernel function of the posterior 
distribution p(6|y), i.e., 


f(8) = p(8ly) = 
where C = f k(0)d0 = f p(y|0)p(0)d0 


? 


k(0) 
cC 


> Typically we do not know C and we know the posterior 
distribution in its unnormalized form, i.e., 


p(y) ex p(y|0)p(0) = k(0) 


Importance Sampling 
for Bayesian Inference 


> |f you want to calculate E[h(0)| where unnormalized density 


of f(A) is k(0) 


Importance Sampling 
for Bayesian Inference 


> If you want to calculate E[h(0)| where unnormalized density 


of f(A) is k(0) 


» We can rewrite this as 


E k(0) o1 k(8) 
Ep) = fne ato) =z [| OTIO 
2 ^ J POOTO. 
where w(0) = 46 in = 559. = 20(6) and f(0) = p(0|y) is the 


normalized posterior density 


Importance Sampling 
for Bayesian Inference 


> By LLN if 01, 02, ..., 0 are iid random samples from g, then 
M T 
7 XC Aal) — CE(h(9)] 
i=l 


as N —> oo. 


MHRD 


ovt. of India 


Importance Sampling Ë athshala 
for Bayesian Inference eT 


> By LLN if 01,02, ..., 0" are iid random samples from g, then 


x T hË) — CE[h(0)| 
as N — oo. 


> Also by LLN, 


Importance Sampling a athshala 
for Bayesian Inference eee d 


> By LLN if 6!,6?,...,0% are iid random samples from g, then 
7 T hË) — CE[h(0)| 
as N — oo. 
> Also by LLN, 
T 


> Therefore Monte Carlo estimator for E[h(0)] using 
Importance Sampling Scheme is 


x Ya AOO) 
YO G(P) 


hrs = 


22 / 27 


Importance Sampling 
for Bayesian Inference 


Importance Sampling 
for Bayesian Inference 


» The method is only reliable when the weights are not too 
variable. 


Importance Sampling Ë athshala 
for Bayesian Inference al 


> The method is only reliable when the weights are not too 
variable. 


> As a rule of thumb, when 
N APA 2 
1 w(6) 
= 1 
ESS x 3 ( = ) 


is less than 5, the IS method is reasonable. Note that 
= N ~ ; 
w = # Viet a(6") 


Importance Sampling Ë athshala 
for Bayesian Inference al 


> The method is only reliable when the weights are not too 
variable. 


> Asa rule of thumb, when 
N APA 2 
1 w(6) 
= 1 
ESS x 3 ( = ) 


is less than 5, the IS method is reasonable. Note that 
= N ~ ; 
wW = # Viet a(6") 


> You will know you chose a bad g when ESS is large. 


Importance Sampling 
for Bayesian Inference 


> When ESS < 5 the variance hrs can be estimated as 


1 N 


ois mE AG — hrs)? 
i=l 
where : 
i _ Eh) 
C= = 


t. of India 


athshala 
Application - Modeling Road Accident Getestet 


» Suppose following data presents number of accidents every 
month for last 12 months in a particular stretch on natitonal 
highway 34 (NH34) 


athshala 
Application - Modeling Road Accident estet 


» Suppose following data presents number of accidents every 
month for last 12 months in a particular stretch on natitonal 
highway 34 (NH34) 


> 6,2,2,1,2,1,1,2,3,5,2,1 


t. of India 


athshala 
Application - Modeling Road Accident Getestet 


» Suppose following data presents number of accidents every 
month for last 12 months in a particular stretch on natitonal 
highway 34 (NH34) 


> 6,2,2,1,2,1,1,2,3,5,2,1 


th 


> We define y; as the number of accident on i* month, i.e., 


Y= 6, yo = 2 tus Uy12 = 1. 


t. of India 


a Jathshala 
Application - Modeling Road Accident eS UST aT 


> Suppose following data presents number of accidents every 
month for last 12 months in a particular stretch on natitonal 
highway 34 (NH34) 


> 6,2,2,1,2,1,1,2,3,5,2,1 


h 


> We define y; as the number of accident on it month, i.e., 


Y= 6, yo = 2 tus y12 = 1. 


> We assume y; uj Poisson(0) and 
0 ~ log-Normal(u = 2,0 = 1) 


di Jathshala 
Application - Modeling Road Accident fe) 


> In this problem since a non-conjugate prior is being selected 
the posterior distribution is known upto normalizing constant. 


athshala 
Application - Modeling Road Accident Qüsteem t 


> In this problem since a non-conjugate prior is being selected 
the posterior distribution is known upto normalizing constant. 


> Simulating random samples from the unnormalized posterior 
distribution is difficult 


26 / 27 


MHRD 
Hil Govt. of India 
RO, 


@Qethshala 
Application - Modeling Road Accident eS UST aT 


> |n this problem since a non-conjugate prior is being selected 
the posterior distribution is known upto normalizing constant. 


> Simulating random samples from the unnormalized posterior 
distribution is difficult 


> So we simulate from a candidate distribution and using the 
importance sampling method we estimate the 0 as 


[1] 2.737056 


26 / 27 


Importance Sampling 


> Note that simulated (01, ..., 0) is not from the target density 
. it is from the candidate density g 


208 Jathshala 
Importance Sampling «S UTOSITeIT 


> Note that simulated (01, ..., 0) is not from the target density 
. it is from the candidate density g 


> IS is an estimation method 


Ak athshala 
Importance Sampling ) TO SITeIT 


> Note that simulated (01, ..., 09V) is not from the target density 
. it is from the candidate density g 


> IS is an estimation method 


> Later we will see that simulation from target can be done by 
adding step to IS 


27 / 27 


MHRD 


ovt. of India 


athshala 
ef ) TTS ITeIT 


A Gateway to all Postgraduate Courses An MHRD project under its National Mission on Education through ICT (NME-ICT) 


Subject: Statistics 


Paper: Statistical Inference 
Module: Markov Chain Monte Carlo 


MHRD 


Iil Govt. of India 


@Qethshala 
Development Team eS UST 


Principal investigator: Dr. Bhaswati Ganguli, Professor, 
Department of Statistics, University of Calcutta 

Paper co-ordinator: Dr. Dipak K Dey, Associate Dean and BOT 
Distinguished Professor, Department of Statistics, 
University of Connecticut 

Content writer: Dr. Sourish Das, Assistant Professor, Chennai 
Mathematical Institute 


Content reviewer: Department of Statistics, University of Calcutta 


d D Jathshala 
Introduction v 9 UTSATT 


» Markov Chain a stochastic process in which future states are 
independent of past states given the present state 


MHRD 
JJ Govt. of India 
Q2 


athshala 
Introduction estet 


bm 


» Markov Chain a stochastic process in which future states are 
independent of past states given the present state 


» Monte Carlo simulation technique 


MHRD 


Iil Govt. of India 


Ë Jathshala 
Introduction € S OTST AT 


» Markov Chain a stochastic process in which future states are 
independent of past states given the present state 


» Monte Carlo simulation technique 


> Up until now, we have done Monte Carlo simulation to find 
integrals instead of doing it analytically. This process is called 
Monte Carlo Integration. 


vt. of India 


Ë Jathshala 
Monte Carlo Integration QUITS eT 


> Suppose we have a distribution f(0) (perhaps a posterior) 
from which we want to draw samples. 


Monte Carlo Integration 


> Suppose we have a distribution f(0) (perhaps a posterior) 
from which we want to draw samples. 


> To derive it analytically, we need to take integrals: 
= | BONOT 
(s) 


where h(0) is some function of 0 (h(0) = 0 for the mean and 
h(0) — (0 — E(0))? for the variance). 


vt. of India 


. Ag athshala 
Monte Carlo Integration ) STS RIT AT 


> We can approximate the integrals via Monte Carlo Integration 
by simulating M values from p(0) and calculating 


ËR athshala 
Q 


Monte Carlo Integration 9 uTóSITelT 


> For example, we can compute the expected value of the 
Beta(3, 4) distribution analytically: 


= [ oro yao f 668-1 (1 — 0)“ 0y-1d8 = 7 = 0.4286 


or via Monte Carlo methods: 

> set.seed (88732) 

> M <- 50000 

> beta.sims <- rbeta(M, 3, 4) 
> sum(beta.sims) /M 


[1] 0.4288105 


Ë Jathshala 
Monte Carlo Integration QUST 


> Our Monte Carlo approximation Í is a simulation estimator 
such that J —> I as M —> oo 


This follows from SLLN. 


Ak athshala RD 
Strong Law of Large Number (SLLN) JUTOSITeIT aay" 


> Let X1,X», ... be a sequence of independent and identically 
distributed random variables, each having a finite mean 
p = E(Xi). 
Then with probability 1, 


Xi X2 c... XM 


M >u as M — œ 


Strong Law of Large Number (SLLN) 


> Let X1,X», ... be a sequence of independent and identically 
distributed random variables, each having a finite mean 
p = E(Xi). 
Then with probability 1, 


Xi X2 c... XM 


M >u as M — œ 


» |n the example, each simulated random sample was 
independent and distributed from the same Beta(3, 4) 
distribution. 


Ë Jathshala 
Strong Law of Large Number (SLLN) © QUST oT 


> Let X1,X», ... be a sequence of independent and identically 
distributed random variables, each having a finite mean 
p = E(Xi). 
Then with probability 1, 


Xi X2 c... XM 
M 


>u as M — œ 


> |n the example, each simulated random sample was 
independent and distributed from the same Beta(3, 4) 
distribution. 


> But what if we cannot generate draws that are independent? 


& Jathshala 
Fail to Draw Independent Random Samples*e'S YT6SITeIT 


» Suppose we want to draw from our posterior distribution 
p(0|y), but we cannot draw independent sample from it. 


athshala 
Fail to Draw Independent Random Sample ) TO SITeIT 


MHRD 
JJ Govt. of India 
G3 


bm 


» Suppose we want to draw from our posterior distribution 
p(0|y), but we cannot draw independent sample from it. 


> For example, we often do not know the normalizing constant. 


athshala 
Fail to Draw Independent Random Sample ) TO SITeIT 


vt. of India 


» Suppose we want to draw from our posterior distribution 
p(0|y), but we cannot draw independent sample from it. 


> For example, we often do not know the normalizing constant. 


> We may be able to sample draws from p(0|y) that are slightly 
dependent. 


Fail to 


v 


A Pathshala 
Draw Independent Random Samples WTOSITelIT 


Iil Govt. of India 


Suppose we want to draw from our posterior distribution 
p(0|y), but we cannot draw independent sample from it. 


For example, we often do not know the normalizing constant. 


We may be able to sample draws from p(0|y) that are slightly 
dependent. 


If we can draw slightly dependent samples using a Markov 
chain, then can we solve the integration? 


208 Jathshala 
What is a Markov Chain? T A UTOSITeIT 


» Definition: Markov Chain is a stochastic process in which 
future states are independent of past states given the present 
state 


10 / 29 


athshala 
What is a Markov Chain? Qüsteem t 


» Definition: Markov Chain is a stochastic process in which 
future states are independent of past states given the present 
state 


> Stochastic process: a consecutive set of random (not 
deterministic) quantities defined on some known state space 
o. 


10 / 29 


MHRD 
J]!N Govt. of India 
E0, 


é Jathshala 
v. 


What is a Markov Chain? YTS AT 


> Definition: Markov Chain is a stochastic process in which 
future states are independent of past states given the present 
state 


> Stochastic process: a consecutive set of random (not 
deterministic) quantities defined on some known state space 
o. 


> Think of © as our parameter space 


10 / 29 


MHRD 


Hi Govt. of India 


Ë Jathshala 
What is a Markov Chain? €'& YTS AT 


» Definition: Markov Chain is a stochastic process in which 
future states are independent of past states given the present 
state 


> Stochastic process: a consecutive set of random (not 


deterministic) quantities defined on some known state space 
o. 


> Think of © as our parameter space 


> consecutive implies a time component, indexed by t. 


10 / 29 


What is a Markov Chain? 


> Consider a draw of 6°) to be a state at iteration t. 


11/29 


JJ Govt. of India 


a Jathshala 
What is a Markov Chain? © 9 YTS AT 


> Consider a draw of 6) to be a state at iteration t. 


> The next draw 6+) is dependent only on the current draw 
(4), and not on any past draws. 


11/29 


What is a Markov Chain? 


> Consider a draw of 6) to be a state at iteration t. 


> The next draw 6+) is dependent only on the current draw 
0(0, and not on any past draws. 


» Markov Property: 


p(((t*D|g(D, 06D, aD) = p(g(t*D|g(?) 


vt. of India 


di Jathshala 
What is Markov Chain? GGorsere 


» So our Markov chain is a bunch of draws of 0 that are each 
dependent on the previous draw. 


athshala 
What is Markov Chain? Qüsteem t 


> So our Markov chain is a bunch of draws of 0 that are each 
dependent on the previous draw. 


» The chain wanders around the parameter space, remembering 
only where it has been in the last time. 


MHRD 
J]!N Govt. of India 
E0, 


Ë Jathshala 

What is Markov Chain? OUTST eT 

> So our Markov chain is a bunch of draws of 0 that are each 
dependent on the previous draw. 


> The chain wanders around the parameter space, remembering 
only where it has been in the last time. 


> What are the rules that govern how the chain jumps from one 
state to another at each time? 


MHRD 


Hi Govt. of India 


a Jathshala 

What is Markov Chain? $^ €) UTOSITeIT 

» So our Markov chain is a bunch of draws of 0 that are each 
dependent on the previous draw. 


> The chain wanders around the parameter space, remembering 
only where it has been in the last time. 


» What are the rules that govern how the chain jumps from one 
state to another at each time? 


> The rule that govern the jump are known as transition 
kernel. 


athshala 
Transition Kernel este 


> For discrete state space (k possible states): a k x k matrix of 
transition probabilities. 


13 / 29 


Ë Jathshala 
o 


Transition Kernel "QMS ay” 


> For discrete state space (k possible states): a k x k matrix of 
transition probabilities. 


» Example: Suppose k — 3. The 3 x 3 transition matrix P 
would be 


p(y 0j) | POSTI) | p(s 101.) 
p(0 167) | p(85 627) | pO) 
p(t 163) | (63105) | POST?) 
where the subscripts index the 3 possible values that 0 can 
take. 


13 / 29 


Ë Jathshala 
o 


Transition Kernel "OQSMA ay” 


> For discrete state space (k possible states): a k x k matrix of 
transition probabilities. 


» Example: Suppose k — 3. The 3 x 3 transition matrix P 
would be 


p(y 01) | p(89 101) | p(s 161.) 
p 02) | POSI) | pOg?) 
p(t 163) | (63105) | POST?) 
where the subscripts index the 3 possible values that 0 can 
take. 


> The rows sum to one and define a conditional probability mass 
function, conditional on the current state. 


13 / 29 


Ë Jathshala 
A 


Transition Kernel ‘QUST ay” 


> For discrete state space (k possible states): a k x k matrix of 
transition probabilities. 


> Example: Suppose k = 3. The 3 x 3 transition matrix P 
would be 


p(y 01) | p(89 101) | p(s 7164”) 
p61”) | (163?) | r(6y 109) 
p(t 163) | (63105) | POST?) 
where the subscripts index the 3 possible values that 0 can 
take. 


> The rows sum to one and define a conditional probability mass 
function, conditional on the current state. 


» The columns are the marginal probabilities of being in a 


certain state in the next time. 


vt. of India 


Ë Jathshala 
Q 


Markov Chain WJTOSITeIT 


Discrete example: 
1 Define a starting distribution II) (a 1 x k vector of 
probabilities that sum to one). 


athshala 
Markov Chain eder de e 


Discrete example: 
1 Define a starting distribution II) (a 1 x k vector of 
probabilities that sum to one). 


> At iteration 1, our distribution II (from which 0 is drawn) 
is 


14 / 29 


GE athshala (i 
Markov Chain +) uTGSITeIT dir" 


Discrete example: 
1 Define a starting distribution II) (a 1 x k vector of 
probabilities that sum to one). 


> At iteration 1, our distribution II? (from which 6 is drawn) 
is 


Is 


2 qi 


(xk) - (1xk) X xP 


14 / 29 


Ë Jathshala 
v. 


Markov Chain )uTOSITeT diss 


Discrete example: 
1 Define a starting distribution II) (a 1 x k vector of 
probabilities that sum to one). 


> At iteration 1, our distribution II (from which 6 is drawn) 
is 


> At iteration 2, our distribution II?) (from which 62 is drawn) 
is 


(2  q() 
Haxe = Haxe x P 


> At iteration t, our distribution II? (from which 0 is drawn) 


Ei Jathshala 
Stationarity of Markov Chain Got 


> Define a stationary distribution m to be some distribution II 
such that 7 = «P 


15/29 


08 Jathshala 
Stationarity of Markov Chain «S UST 


> Define a stationary distribution 7 to be some distribution II 
such that 7 = «P 


» Here we blur the distinction between the Markov chain and its 
transtion matrix P. 


15/29 


vt. of India 


@Qethshala 
Stationarity of Markov Chain +9 ITOSITeIT 


> Define a stationary distribution 7 to be some distribution II 
such that 7 = «P 


» Here we blur the distinction between the Markov chain and its 
transtion matrix P. 


> If posterior distribution is proper then the Markov Chain of 
the MCMC algorithm will typically converge to p(0|y) 
regardless of our starting points. 


15/29 


Ë Jathshala ` ; 
© S WORM AT 


vt. of India 


Stationarity of Markov Chain 


» Define a stationary distribution 7 to be some distribution II 
such that 7 = «P 


» Here we blur the distinction between the Markov chain and its 
transtion matrix P. 


> If posterior distribution is proper then the Markov Chain of 
the MCMC algorithm will typically converge to p(0|y) 
regardless of our starting points. 


> So if we can devise a Markov chain whose stationary 
distribution II is our desired posterior distribution p(0|y) then 
we can run this chain to get draws that approximately from 
p(0|y) once the chain has converged. 


15 / 29 


vt. of India 


Ë Jathshala 
Q 


Fundamental Theorem of Markov Chain "S UTOSITeIT 


Fundamental Theorem of Markov Chain 


If a Markov chain P is irreducible, aperiodic and positive recurret 
then it has an unique stationary distribution 7. 


> This is the unique (normalized such that the entries sum to 1) 
left eigenvector of P with eigenvalue 1. 


> |n addition, 


P(x,y) — v(y) as t — oo for all z,y € Q. 


16 / 29 


d D Jathshala 
Burn-In v: 9 UTOSITeIT 


> In light of the Fundamental theorem of MC, we shall refer to 
an irreducible, aperiodic Markov chain as ergodic. 


17 / 29 


MHRD 
J]IN (Govt. of India 
iO. 


Ë Jathshala 
Burn-In €' S UTOSITeIT 


> In light of the Fundamental theorem of MC, we shall refer to 
an irreducible, aperiodic Markov chain as ergodic. 


> So as long as we can simulate an ergodic Markov chain, 
irrespective of the starting point the chain will converge to 
target stationary distribution. 


17 / 29 


MHRD 


Hi Govt. of India 


Ë Jathshala 
A 


Burn-in E areara 


> In light of the Fundamental theorem of MC, we shall refer to 
an irreducible, aperiodic Markov chain as ergodic. 


> So as long as we can simulate an ergodic Markov chain, 
irrespective of the starting point the chain will converge to 
target stationary distribution. 


» However, the time it takes for the chain to converge may vary; 
depending on the starting point. 


17 / 29 


Ë Jathshala 
A 


Burn-in "S UTOSITeIT 


v 


In light of the Fundamental theorem of MC, we shall refer to 
an irreducible, aperiodic Markov chain as ergodic. 


> So as long as we can simulate an ergodic Markov chain, 
irrespective of the starting point the chain will converge to 
target stationary distribution. 


» However, the time it takes for the chain to converge may vary; 
depending on the starting point. 


> As a best practice, it is advisable to throw out a certain 
number of the first draws, known as the burn-in. 


S iii: 


athshala 
Monte Carlo Integration on Markov Chain Qüseem t 


» Once we have a Markov chain that has converged to the 
stationary distribution, then the draws in our chain appear to 
be like draws from p(0|y), so it seems like we should be able 
to use Monte Carlo Integration methods to find quantities of 
interest. 


18 / 29 


athshala 


HA 
Monte Carlo Integration on Markov Chain ferhan 


MHRD 


Hi Govt. of India 


» Once we have a Markov chain that has converged to the 
stationary distribution, then the draws in our chain appear to 
be like draws from p(0|y), so it seems like we should be able 
to use Monte Carlo Integration methods to find quantities of 
interest. 


» One Problem: MC draws are not independent, which we 
required for Monte Carlo Integration to work (required 
condition for SLLN to work). 


18 / 29 


MHRD 


Hi Govt. of India 


a Jathshala 
A 


Monte Carlo Integration on Markov Chain *we'S VTOSITeIT 


» Once we have a Markov chain that has converged to the 
stationary distribution, then the draws in our chain appear to 
be like draws from p(0|y), so it seems like we should be able 
to use Monte Carlo Integration methods to find quantities of 
interest. 


» One Problem: MC draws are not independent, which we 
required for Monte Carlo Integration to work (required 
condition for SLLN to work). 


> The answer is Ergodic Theorem 


18 / 29 


a Jathshala 
SOUS dU 


Ergodic Theorem 
Ergodic Theorem 
Let 0), 902, ...,0) be T values from Markov chain that is 


aperiodic, irreducible and positive recurrent (that is the chain is 
ergodic) and E[g(0)| < oo Then with probability 1 


is (80) — i OROL 
T 2 g e 


as T —> oo where s is the stationary distribution. 


19 / 29 


Ë Jathshala 
QST de 


Ergodic Theorem 


Ergodic Theorem 


Let 0(0, 9(2). .... 9(7) be T values from Markov chain that is 
aperiodic, irreducible and positive recurrent (that is the chain is 
ergodic) and E[g(0)| < oo Then with probability 1 


gp 
il 
7 > 916) — | OOd 
t=1 e 
as T —> oo where 7 is the stationary distribution. 


> This is the Markov chain analog to the LLN 


19 / 29 


à Jathshala 
HST dis" 


Ergodic Theorem 


Ergodic Theorem 


Let 0), 902, ...,0) be T values from Markov chain that is 
aperiodic, irreducible and positive recurrent (that is the chain is 
ergodic) and E[g(0)| < oo Then with probability 1 


T 
1 
7910) — [ att ()as 
t=1 
as T —>+ œ where 7 is the stationary distribution. 


> This is the Markov chain analog to the LLN 


> But what does it mean for a chain to be aperiodic, irreducible, 
and positive recurrent? 


I oiu 


Aperiodicity 


> A Markov chain P is aperiodic if for all x,y € Q we have 
god Es Pos m Dp = 1. 


t. of India 


Ak athshala 
Aperiodicity 9 uTóSITelT 


> A Markov chain P is aperiodic if for all x,y € Q we have 
god Es Prag DP = 1. 


> If the only length of time for which the chain repeats some 
cycle of values is the trivial case with cycle length equal to 
one then the chain is aperiodic. 


vt. of India 


Ë Jathshala 
Q 


Aperiodicity "$ YTOSITeIT 


> A Markov chain P is aperiodic if for all z, y € Q we have 
god Es Prag DP = 1. 


> If the only length of time for which the chain repeats some 
cycle of values is the trivial case with cycle length equal to 
one then the chain is aperiodic. 


> Intuitively we can say repeatation is allowed but as long as the 
chain is not repeating itself in an identical cycle, then the 
chain is aperiodic 


d D Jathshala 
Irreducibility «S ITOSITeIT 


> A Markov chain is irreducible if it is possible go from any 
state to any other state (not necessarily in one step). 


vt. of India 


Ag athshala 
Irreducibility 9 WTSI aT 


» A Markov chain is irreducible if it is possible go from any 
state to any other state (not necessarily in one step). 


> A Markov chain P is irreducible if for all x,y, there exists 
some t such that P'(z, y) > 0. 


vt. of India 


@Qethshala 
Irreducibility eG ITOSITeIT 


» A Markov chain is irreducible if it is possible go from any 
state to any other state (not necessarily in one step). 


> A Markov chain P is irreducible if for all x,y, there exists 
some t such that P'(z, y) > 0. 


> We can show if P is irreducible, then P is aperiodic 
<=> thereexiststsuchthatP' (az, y) > 0 for all x,y € Q. 


208 Jathshala 
Positive Recurrence v 9 UTSATT 


» A Markov chain is recurrent if for any given state i , if the 
chain starts at 2, it will eventually return to ? with probability 
1. 


athshala 
Positive Recurrence Qüsteem t 


» A Markov chain is recurrent if for any given state i , if the 
chain starts at 2, it will eventually return to ? with probability 
1. 


» A Markov chain is positive recurrent if the expected return 
time to state / is finite; otherwise it is null recurrent. 


MHRD 


T ovt. of India 


Ë Jathshala 
Positive Recurrence © S UTOSITelIT 


» A Markov chain is recurrent if for any given state 7 , if the 
chain starts at 2, it will eventually return to z with probability 
1. 


» A Markov chain is positive recurrent if the expected return 
time to state / is finite; otherwise it is null recurrent. 


> So if our Markov chain is aperiodic, irreducible, and positive 
recurrent then it is ergodic and the ergodic theorem allows us 
to do Monte Carlo Integration by calculating E[g(0)] from 
draws, ignoring the dependence between draws. 


vt. of India 


. Ak athshala 
What is MCMC? ) ATO SITeIT 


» MCMC is a class of Monte Carlo methods in which we can 
simulate dependent sample that are approximately from a 
posterior probability distribution. 


What is MCMC? 


» MCMC is a class of Monte Carlo methods in which we can 
simulate dependent sample that are approximately from a 
posterior probability distribution. 


» We then take the draws and calculate quantities of interest for 
the posterior distribution. 


MHRD 


Iil Govt. of India 


Ë Jathshala 
A 


What is MCMC? NS UTS aI eT 


>» MCMC is a class of Monte Carlo methods in which we can 
simulate dependent sample that are approximately from a 
posterior probability distribution. 


» We then take the draws and calculate quantities of interest for 
the posterior distribution. 


» |n Bayesian statistics, there are generally two MCMC 
algorithms that we use: the Gibbs Sampler and the 
Metropolis-Hastings algorithm. 


23 / 29 


Application 


» DAX and FTSE are two stock market indexes for German and 
UK stock exchanges respectively. 


t. of India 


athshala 
Application eder 


» DAX and FTSE are two stock market indexes for German and 
UK stock exchanges respectively. 


> Suppose PP is the DAX value on t'^-day 


> and Pf is the FTSE value on t^^-day 


a Jathshala 
v. 


Application 9 uTO SITeIT 


» DAX and FTSE are two stock market indexes for German and 
UK stock exchanges respectively. 


> Suppose PP is the DAX value on t'^-day 
> and Pf is the FTSE value on t^^-day 
> Corresponding log-return is 


> rp = log(PP) — log( PP) 
> rj —log(F7)- log(P7;) 


Application 


» We want to model the relationship as 


rf —acfrP-ce 


where e ~ N (0,60?) 


t. of India 


athshala 
Application C LLLI 


» We want to model the relationship as 
rh =a+Brete 


where e ~ N (0,60?) 


> a 0 means FTSE is undervalued compare to DAX 


t. of India 


athshala 
Application eder 


» We want to model the relationship as 
rh =a+Brete 


where e ~ N (0,6?) 
> a > 0 means FTSE is undervalued compare to DAX 


> a < 0 means FTSE is overvalued compare to DAX 


athshala 
Application eder 


» We want to model the relationship as 
rh =a+Brp te 


where e ~ N (0,6?) 
> a > 0 means FTSE is undervalued compare to DAX 
> a <0 means FTSE is overvalued compare to DAX 


> a = 0 means FTSE is fairly valued compare to DAX 


Application 


» We want to model the relationship as 


rf —acfrP-ce 


where e ~ N (0,6?) 


t. of India 


athshala 
Application eder 


» We want to model the relationship as 
rh =a+Brete 


where e ~ N (0,60?) 


> f > l1 means systematic risk of FTSE is more than DAX 


athshala 
Application eder 


» We want to model the relationship as 
rh =a+Brp te 


where e ~ N (0,0?) 
> DB > 1 means systematic risk of FTSE is more than DAX 


> f « 1 means systematic risk of FTSE is less than DAX 


athshala 
Application eder 


» We want to model the relationship as 
rh =a+Brp te 


where e ~ N (0,0?) 
> DB > 1 means systematic risk of FTSE is more than DAX 
> 3 < 1] means systematic risk of FTSE is less than DAX 


> D = 1 means systematic risk of FTSE is equal to DAX 


Application 


> Prior: 
> a~ N(0,03) 


vt. of India 


a Jathshala 
Application © H ITOSITeIT 


> Prior: 
> a~ N(0,03) 


> B~ N(1,02) 


vt. of India 


a Jathshala 
Application © H ITOSITeIT 


> Prior: 
> ac N(0,03) 


E B ~ N(1,0%) 


> o™? ~ Gamma(co/2, do/2) 


vt. of India 


athshala 
Application ette 


> Prior: 
> a~ N(0,02) 


g B ~ N(1, 03) 


> 0 2 ~ Gamma(co/2, do/2) 
> > library(MCMCpack) 
> posterior <- MCMCregress(FTSE^DAX 
+ , b0-c(0,1) 
* , B0 - 0.1 
+ , Sigma.mu = 5 
+ , Sigma.var = 25 
+ , data-log return 
T , verbose-0) 


0.005 


-0.010 


0.5 


-0.5 


0.0120 


0.0100 


Trace of alpha 


T T T T T 
2000 4000 6000 8000 10000 


Iterations 


Trace of beta 


T T T T T 
2000 4000 6000 8000 10000 


Iterations 


Trace of sigma.sq 


T T T T T 
2000 4000 6000 8000 10000 


150 


1.0 


0.0 


1000 


0 400 


id o athshala 


-0.010 -0.005 0.000 0.005 0.010 


N =10000 Bandwidth = 0.0004053 


Density of beta 


N «10000 Bandwidth = 0.03948 


Density of sigma.sq 


0.0095 0.0105 0.0115 0.0125 


athshala 15 mHRD 
Application e S UTOSITeIT = 
Summary 
2.5% 25% 50% 75% 97.5% 
alpha -0.00465 -0.00152 0.00013 0.00174 0.00487 
beta 0.03430 0.33824 0.49926 0.65432 0.94987 


sigma.sq 0.01011 0.01054 0.01078 0.01102 0.01149 


> 95% Cl of a indicates that FTSE is fairly valued compare to 
DAX; since the interval include 0 in it. 


> 95% Cl of B indicates that the systematic risk of FTSE is less 
than DAX; since the interval doesnot include 1 or more 
precisely 


P(B < 1|Data) = 13 74 (B9 <1) 


> beta«-data.frame(posterior)$beta 
> length(beta[beta«1])/length(beta) 


[1] 0.985 


GE oa NW iGevt. of India 
ICICI 


A Gateway to all Postgraduate Courses An MHRD project under its National Mission on Education through ICT (NME-ICT) 


Subject: Statistics 


Paper: Statistical Inference 


Module: Bayesian Regression Analysis - Part 1 


MHRD 


Iil Govt. of India 


@Qethshala 
Development Team oS UST AT 


Principal investigator: Dr. Bhaswati Ganguli, Professor, 
Department of Statistics, University of Calcutta 

Paper co-ordinator: Dr. Dipak K Dey, Associate Dean and BOT 
Distinguished Professor, Department of Statistics, 
University of Connecticut 

Content writer: Dr. Sourish Das, Assistant Professor, Chennai 
Mathematical Institute 


Content reviewer: Department of Statistics, University of Calcutta 


Introduction 


> Bayesian Regression with conjugate and non-conjugate priors 


208 Jathshala 
Introduction S UTOSITeIT 


> Bayesian Regression with conjugate and non-conjugate priors 


> Set-up of the Bayesian Regression Model 


MHRD 
JJ Govt. of India 
Q2 


athshala 
Introduction Getestet 


bm 


> Bayesian Regression with conjugate and non-conjugate priors 


> Set-up of the Bayesian Regression Model 


> The standard "improper" non-informative prior 


athshala 
Introduction Queen | 


M 


Bayesian Regression with conjugate and non-conjugate priors 


v 


Set-up of the Bayesian Regression Model 


M 


The standard "improper" non-informative prior 


M 


Conjugate Prior Analysis 


Multiple Linear Regression 


» Many studies concern the relationship between two or more 
observable quantities. 


. . . Ag athshala 
Multiple Linear Regression ) STS RIT AT 


» Many studies concern the relationship between two or more 
observable quantities. 


» How do changes in quantity y (the dependent variable) vary as 
a function of another quantity x (the independent variable)? 


MHRD 


Hi Govt. of India 


@Qathshala 
Multiple Linear Regression QUST 


» Many studies concern the relationship between two or more 
observable quantities. 


> How do changes in quantity y (the dependent variable) vary as 
a function of another quantity x (the independent variable)? 


> Regression models allow us to examine the conditional 
distribution of y given x, parameterized as p(y|3,x) when the 
n observations (yi, x;) are exchangeable. 


vt. of India 


. . . Ak athshala 
Multiple Linear Regression ) STS RIT AT 


> The normal linear model occurs when a distribution of y given 
x is normal with a mean equal to a linear function of X: 


E(yil8, X) = £1Xii + Bo Xoi +... + Bk Xki 


for i € (1,2,...,n) and X, is a vector of one's. 


MHRD 
J]!N Govt. of India 
E0, 


a Jathshala 
Multiple Linear Regression © e ITOSITeIT 


» The normal linear model occurs when a distribution of y given 
x is normal with a mean equal to a linear function of X: 


E(yi|B, X) = B1Xii + £2Xoi +... + Bk Xii 
for i € (1,2,...,n) and X, is a vector of one's. 
> The ordinary linear regression model occurs when the variance 


of y given X, B is assumed to be constant over all 
observations. 


MHRD 


Hil Govt. of India 


@Qethshala 
Multiple Linear Regression QUST 


> The normal linear model occurs when a distribution of y given 
x is normal with a mean equal to a linear function of X: 


E(yil8, X) = £1Xii + Bo Xoi +... + Bk Xki 


for i € (1,2,...,n) and X, is a vector of one's. 


> The ordinary linear regression model occurs when the variance 
of y given X, B is assumed to be constant over all 
observations. 


> In other words, we have an ordinary linear regression model 
when: 
yi ~ N(f + Boi Xoi +... + Bi Xii, 02) 


for i € (1,2, ...,n). 


mu——X——Á—— 9. ———— IT —— ——! 


208 Jathshala 
Multiple Linear Regression «S UTOSITeIT 


> Ify; ~ N(Bi + Bo; Xai 4- ... + Byi Xii, 07) then it is well known 
that the ordinary least squares estimates and the maximum 
likelihood estimates of the parameters 61, ..., 8, are equivalent. 


t. of India 


. . . Ak athshala 
Multiple Linear Regression ) STS RIT AT 


> If y; ~ N(B1 + Bo; Xai +... + BkiX ki, 07) then it is well known 
that the ordinary least squares estimates and the maximum 
likelihood estimates of the parameters 61, ..., Bg are equivalent. 


> If 8 = [61,..., Bk]? then the frequentist estimate of £ is 


B= (XTX) !XTy 


t. of India 


Ë Jathshala 
Multiple Linear Regression ~ e UTOSITeIT 


> If y; ~ N(B1 + Boi Xai +... + BkiX ki, 07) then it is well known 
that the ordinary least squares estimates and the maximum 
likelihood estimates of the parameters 61, ..., Bg are equivalent. 


> If 8 = [£1,..., 84]? then the frequentist estimate of 8 is 


B -(XTX)!XxTy 


» We know by the Gauss-Markov theorem that this estimate is 
BLUE. 


a Jathshala 
Multiple Linear Regression eG ITOSITeIT 


> If y; ^ N(B1 + Boi Xai +... + BkiX ki, 07) then it is well known 
that the ordinary least squares estimates and the maximum 
likelihood estimates of the parameters 61, ..., Bg are equivalent. 


> If 8 = [£1,..., 8x]? then the frequentist estimate of £ is 


B -(XTX)!XxTy 


» We know by the Gauss-Markov theorem that this estimate is 
BLUE. 


» The frequentist estimate of c? is 
s° = (y — XB) (y - XB)/(n— k) 


vt. of India 


. . . Ak athshala 
Multiple Linear Regression ) STS RIT AT 


> The uncertainty about the quantities 8 is summarized by the 
regression coefficients standard error which is the diagonal of 
the matrix: Var(B) = o?(XT X)-1 


. . . Ak athshala 
Multiple Linear Regression ) STS RIT AT 


> The uncertainty about the quantities 8 is summarized by the 
regression coefficients standard error which is the diagonal of 
the matrix: Var(B) = o?(XT X)-1 


> Finally, we know that if V; is the it” diagonal element of 


Var(B), then : (Ê — 0)/ V/sVi ~ tn-k 


t. of India 


a Jathshala 
Multiple Linear Regression QUST 


> The uncertainty about the quantities 8 is summarized by the 
regression coefficients standard error which is the diagonal of 
the matrix: Var(B) = o?(XT X)-1 


> Finally, we know that if V; is the i^" diagonal element of 


Var(B), then : (8 — 0)/V/sVi ~ tn-k 


> This statistic forms the basis for our hypothesis tests. 


t. of India 


Ak athshala 
Bayesian Setup ) STS RIT AT 


> For the normal linear model, we have: 


Yi ^v N (ui, o?) 


for i € (1,2,3, ..., n) where p; is just an indicator for the 
expression: 
hi = b1Xii us PRX 


vt. of India 


Ag athshala 
Bayesian Setup 9 uTóSITelT 


> For the normal linear model, we have: 


Ui ~ N(ui,ca?) 


for i € (1,2,3, ..., n) where p; is just an indicator for the 
expression: 
hi = Xu TP 


> The object of statistical inference is the posterior distribution 
of the parameters /;..., 8 and c? 


t. of India 


Ak athshala 
Bayesian Setup S TO SITeIT 


> For the normal linear model, we have: 


Ui ~ N(ui,ca?) 


for i € (1,2,3, ..., n) where p; is just an indicator for the 
expression: 
hi = Xu TP 


> The object of statistical inference is the posterior distribution 
of the parameters (31..., 8, and c? 


> By Bayes Rule, we know that this is simply: 


p(fi, ++) B, o? |y, X s ]In Yili, o? p(Pr, .. a Bi, 0?) 


t. of India 


e Jathshala ' 
Bayesian Regression with Noninformative PYY UTOSITeIT 


> By Bayes Rule, the posterior distribution is: 


P(B1, ++) B, o? |y, X s ]In yilli c? p( Ar, .. qe) 


Ak athshala 
Bayesian Regression with Noninformative PS H UTOSITeIT 


» By Bayes Rule, the posterior distribution is: 


p(B1, -- By, o? |y, X s ]In yilli a? )p(Br, .. qim.) 


> To make inferences about the regression coefficients, we 
obviously need to choose a prior distribution for 3, c? 


Ak athshala 
Bayesian Regression with Noninformative PS H UTOSITeIT 


» By Bayes Rule, the posterior distribution is: 


p(B1, -- By, o? |y, X s ]In yilli a? )p(Br, .. qim.) 


> To make inferences about the regression coefficients, we 
obviously need to choose a prior distribution for 3, c? 


> The standard non-informative prior distribution is uniform on 
(8, logo?) which is equivalent to : 


p(B, log(o?)) x o? 


athshala 


Bayesian Regression with Noninformative pagan 


> By Bayes Rule, the posterior distribution is: 


p(B1, -- By, o? |y, X s ]In yilli a? )p(Br, .. qim.) 


> To make inferences about the regression coefficients, we 
obviously need to choose a prior distribution for 3, c? 


> The standard non-informative prior distribution is uniform on 
(8, logo?) which is equivalent to : 


p(B, log(o?)) x o? 


> This prior is a good choice for statistical models when you 
have a lot of data points and only a few parameters. 


a Jathshala 
Bayesian Regression with Noninformative PYY UTOSITeIT 


> This prior is a good choice for statistical models when you 
have a lot of data points and only a few parameters. 


é Jathshala 
Bayesian Regression with Noninformative PYY UTOSITeIT 


» This prior is a good choice for statistical models when you 
have a lot of data points and only a few parameters. 


> Why? 


Ë Jathshala 
Bayesian Regression with Noninformative PYY UT6SITeIT 


vt. of India 


> This prior is a good choice for statistical models when you 
have a lot of data points and only a few parameters. 


> Why? 


> Because if you have a large lot of data and few parameters, 
then the likelihood function is very sharply peaked, which 
means that the likelihood (the data) will dominate posterior 
inferences. 


10 / 27 


athshala 


dd 0 
Bayesian Regression with Noninformative PAG erat 


MHRD 


Hi lGovt. of India 


v 


This prior is a good choice for statistical models when you 
have a lot of data points and only a few parameters. 


> Why? 


> Because if you have a large lot of data and few parameters, 
then the likelihood function is very sharply peaked, which 
means that the likelihood (the data) will dominate posterior 
inferences. 


» With small sample sizes or a lot of parameters, prior 
distributions or hierarchical models become more important 
for analysis. 


10 / 27 


Ë Jathshala RD 
Bayesian Regression with Noninformative P YUTSA d" 


> Ify ~ N(XB,o?) and p(8,log(o?)) x a72, then conditional 
posterior distribution is 


»(B|c?, y, X) ~ MN&(B,o?(XT X)-1) 


where  — (XT X)! XTy 


Ak athshala 
Bayesian Regression with Noninformative PS H UT6SITelIT 


> Ify ~ N(XB,o?) and p(8,log(o?)) x e-?, then conditional 
posterior distribution is 


p(B\o? Y; A) ~ MN;,(8,07(X7X)) 


where Ê = (XT X)! XTy 


> This follows from by completing the square, like we have seen 
over and over again when dealing with the normal distribution. 


11/27 


Ë Jathshala 
Bayesian Regression with Noninformative PYY UTSATT de" 


> Ify ~ N(XB,o?) and p(B,log(o?)) x a72, then conditional 
posterior distribution is 


p(B|c?, y, X)  MNx(B,o? (XTX) 
where  — (XT X)! XTy 


> This follows from by completing the square, like we have seen 
over and over again when dealing with the normal distribution. 


> The posterior distribution of c? can be written as: 
p(c?|y, X) ~ Scaled — Invx?(n — k, s?) 
where s? = (y — XB)" (y — XB)/(n — k) 


11/27 


e Jathshala ' 
Bayesian Regression with Noninformative PYY UTOSITeIT 


t. of India 


> Integrating out c? from the full posterior distribution the 
marginal posteror distribution of 5 follow Multivariate 
t-distribution, i.e., 


v(Bly, X) ~ tr-n (B, s? (X7 X) *) 


Ak athshala 
Bayesian Regression with Noninformative PS H UT6SITelIT 


> Integrating out c? from the full posterior distribution the 
marginal posteror distribution of 5 follow Multivariate 
t-distribution, i.e., 


v(Bly, X) ~ tra (B, ° (XTX) 


» Notice the close comparison with the classical results. The key 
difference would be interpretation of the standard errors. 


vt. of India 


. Ak athshala 
Bayesian Model Evaluation ) STS RIT AT 


> Model evaluation proceeds just like it would in the traditional 
OLS framework. Models are largely evaluated based on their 
fitted (predicted) values. 


13/27 


t. of India 


. Ak athshala 
Bayesian Model Evaluation ) STS RIT AT 


> Model evaluation proceeds just like it would in the traditional 
OLS framework. Models are largely evaluated based on their 
fitted (predicted) values. 


> For example, 


where jj; is the fitted value of y. 


13/27 


vt. of India 


Ë Jathshala 
Bayesian Model Evaluation QUITS 


> Model evaluation proceeds just like it would in the traditional 
OLS framework. Models are largely evaluated based on their 
fitted (predicted) values. 


> For example, 


where jj; is the fitted value of y. 


> In this respect, Bayesian model evaluation is straight forward. 


13 / 27 


Ë Jathshala 
oS UTOSITeIT 


t. of India 


Bayesian Model Evaluation 


» Model evaluation proceeds just like it would in the traditional 
OLS framework. Models are largely evaluated based on their 
fitted (predicted) values. 


> For example, 


where jj; is the fitted value of y. 
> In this respect, Bayesian model evaluation is straight forward. 


» The fitted value for an observation i is a random draw from a 
t distribution with mean X;6 and variance 
s?(I + X;(XT X)X7) on (n — k) degrees of freedom. 


13 / 27 


Bayesian Model Evaluation 


» |n most cases, you can simply use KB as the fitted value and 
ignore this additional uncertainty. 


e Jathshala 
Bayesian Model Evaluation QUITS eT 


> In most cases, you can simply use KB as the fitted value and 
ignore this additional uncertainty. 


> Other diagnostics that you should use are plots of errors 
versus covariates histograms for the error term, etc. to make 
sure that the assumptions of the normal linear model hold. 


14 / 27 


t. of India 


Ak athshala 
Comment on the regression setup 9 uTóSITelT 


There are two subtle points regarding the Bayesian regression 
setup. 


> First, a full Bayesian model includes a distribution for the 
independent variable X, p(X|W). 


Therefore we have a joint likelihood p(X, y|V, 8,0) and joint 
prior p(V, B, c). 


15 / 27 


Ë Jathshala 
oS UTOSITeIT 


t. of India 


Comment on the regression setup 


There are two subtle points regarding the Bayesian regression 
setup. 
> First, a full Bayesian model includes a distribution for the 

independent variable X, p(.X|V). 
Therefore we have a joint likelihood p(X, y| V, 8,0) and joint 
prior p(V, B, c). 
The fundamental assumption of the normal linear model is 
that p(y|.X, 8,c) and p(X|W) are independent in their prior 
distributions such that the posterior distribution factors into: 


p(V, 8, o|X, y) = p(B, o|X, y)p(V|X, y). 


TU GI 


a Jathshala 
o 


Comment on the regression setup "$ YTOSITeIT 


There are two subtle points regarding the Bayesian regression 
setup. 
> First, a full Bayesian model includes a distribution for the 

independent variable X, p(.X|V). 
Therefore we have a joint likelihood p(X, y| V, 8,0) and joint 
prior p(V, B, c). 
The fundamental assumption of the normal linear model is 
that p(y|.X, 8,c) and p(X|W) are independent in their prior 
distributions such that the posterior distribution factors into: 


P(W, 8, o|X, y) = p(8, o|X, y)p(V|X, y). 
As a result, p(8,c| X, y) x p(B,o)p(y|B, o, X) 


15 / 27 


vt. of India 


Ë Jathshala 
e 


Comment on the regression setup 9 uTóSITelT 


There are two subtle points regarding the Bayesian regression 
setup. 


» Second, when we setup our probability model, we are 
implicitly conditioning on a model, call it H, which represents 
our beliefs about the data-generating process. Thus, 


»(B, o| X, y, H) « p(8,c| H)p(y|B8,o, X, H) 


MHRD 


T ovt. of India 


Ë Jathshala 
A 


Comment on the regression setup "$ YTOSITeIT 


There are two subtle points regarding the Bayesian regression 
setup. 
» Second, when we setup our probability model, we are 
implicitly conditioning on a model, call it H, which represents 
our beliefs about the data-generating process. Thus, 


»(8,c|X, y, H) « p(8,c| H)p(y|B, o, X, H) 


> It is important to keep in mind that our inferences are 
dependent on H, and this is equally true for the frequentist 
perspective, where results can be dependent on the choice of 
likelihood function, covariates, etc. 


Fe; Jathshala 
Conjugate priors and the normal linear mod S Foal 


» Suppose that instead of an improper prior, we decide to use 
the conjugate prior. 


di Jathshala 
Conjugate priors and the normal linear mod ATST AT 


MHRD 
JJ Govt. of India 
Q2 


aia dh 


» Suppose that instead of an improper prior, we decide to use 
the conjugate prior. 


> For the normal regression model, the conjugate prior 
distribution for p(fo, ..., Bk, 0?) is the normal-inverse-gamma 
distribution. 


17/27 


athshala 


id 0 
Conjugate priors and the normal linear mod e ETT 


MHRD 


Hi Govt. of India 


» Suppose that instead of an improper prior, we decide to use 
the conjugate prior. 


> For the normal regression model, the conjugate prior 
distribution for p(fo, ..., Bk, 0?) is the normal-inverse-gamma 
distribution. 


» We have seen this distribution before when we studied the 
normal model with unknown mean and variance. We know 
that this distribution can be factored such that: 


P(Bo, e) Bk, o?) X P(Bo, tery 8x|o^)p(c?) 


17/27 


di Jathshala 
Conjugate priors and the normal linear mod USAT 


> As we have studied normal model with unknown mean and 
variance. We know that this distribution can be factored such 
that: 
P(Bo, -- Bk. 0?) = p(Bo, ..., Belo’) plo?) 


where p(o, ..., Bla’) ~ M Nx (Bo; Ao) 
and p(o?) ~ Inv — Gamma(ao, bo) 


18 / 27 


APathshala 
Conjugate priors and the normal linear mode USAT 


vt. of India 


» As we have studied normal model with unknown mean and 
variance. We know that this distribution can be factored such 
that: 


P(Bo, soii Pk, c?) = p(o, ID Bx|c?)p(o?) 


where p(o, ..., Bela?) ~ M Nx(Bo, Ao) 
and p(o?) ~ Inv — Gamma(ao, bo) 


> |f we use a conjugate prior, then the prior distribution will 
have the same form. Thus, the posterior distribution will also 
follow normal-inverse-gamma. 


18 / 27 


athshala 


id 0 
Conjugate priors and the normal linear mod ESTE] 


MHRD 


Hil Govt. of India 


» As we have studied normal model with unknown mean and 
variance. We know that this distribution can be factored such 
that: 


P(Bo, soii Pk, c?) = p(o, ID Bx|c?)p(o?) 


where p(o, ..., Bela?) ~ M Nx(Bo, Ao) 
and p(o?) ~ Inv — Gamma(ao, bo) 


> |f we use a conjugate prior, then the prior distribution will 
have the same form. Thus, the posterior distribution will also 
follow normal-inverse-gamma. 


> If we integrate out c? the marginal for 8 will be a multivariate 
t-distribution. 


18 / 27 


. Fe; Jathshala 
Conjugate priors and the normal linear mod S WroSITelT 


> Posterior mean: 


E(B|y, X) = (Ag! -- X X) (Ag! Bo + X" XB) 


. ik Jathshala RD 
Conjugate priors and the normal linear mod! WMTOSITeIT d» 


» Posterior mean: 


E(B|y, X) = (Ag! -- X X) (Ag! Bo + X" XB) 


» Notice that the coefficients are essentially a weighted average 
of the prior coefficients described by £y and standard OLS 
estimate 5. 


athshala 


HA 
Conjugate priors and the normal linear mod e ETT 


» Posterior mean: 


E(B|y, X) = (Ag! -- X X) (Ag! o + X" XB) 


> Notice that the coefficients are essentially a weighted average 
of the prior coefficients described by £y and standard OLS 
estimate 5. 


> The weights are provided by the conditional prior precision 
Ag | and the data XT X. 


19 / 27 


athshala 


Hiin 
Conjugate priors and the normal linear mod e EST 


t. of India 


> Posterior mean: 


E(B|y, X) = (Ag! + X X) (Ag! Bo + X" XB) 


> Notice that the coefficients are essentially a weighted average 
of the prior coefficients described by £y and standard OLS 
estimate 5. 


> The weights are provided by the conditional prior precision 
Ag | and the data XT X. 


> This should make clear that as we increase our prior precision 
(decrease our prior variance) for 8 we place greater posterior 
weight on our prior beliefs relative to the data. 


19 / 27 


di Jathshala 
Conjugate priors and the normal linear mod ATST AT 


vt. of India 


> Note: Zellner (1971) treats 89 and conditional prior variance 
Ao in the following way: 
suppose you have two data sets- (y1, X1) and (y2, X2). He 
sets 39 equal to the posterior mean for a regression analysis of 
(y1, X1) with the improper prior 1/c? and set Ao equal to 
XIX; 


Ed Jathshala 
Conjugate priors and the normal linear mode UST ay” 


> To summarize our uncertainty about the coefficients, the 
variance-covariance matrix for 8 is given by: 

8(Ag! + XTX)! 
n+a—-—k-3 


Cov(8|y, X) = 


(y - XB)" (y — XB) 
nc-a—k-—3 
+ (Bo — E(8)) A5! Bo + (Ê — E(8))! X1 XB 


> The posterior standard deviation can be taken from the square 
root of the diagonal of this matrix. 


Conjugate Prior 


» The second term is the maximum likelihood estimate of the 
variance. 


MHRD 


Hil Govt. of India 


@Qethshala 
Conjugate Prior ~ H FST 


» The second term is the maximum likelihood estimate of the 
variance. 


> The third terms states that our variance estimates will be 
greater if our prior values for the regression coefficients differ 
from their posterior values, especially if we indicate a great 
deal of confidence in our prior beliefs by assigning small 
variances in the matrix Ao. 


Ë Jathshala 
A 


Conjugate Prior "e YTOSITeIT 


» The second term is the maximum likelihood estimate of the 
variance. 


> The third terms states that our variance estimates will be 
greater if our prior values for the regression coefficients differ 
from their posterior values, especially if we indicate a great 
deal of confidence in our prior beliefs by assigning small 
variances in the matrix Ao. 


» The fourth term states that our variance estimates for the 
regression coefficients will be greater if the standard OLS 
estimates differ from the posterior values, especially if X7 X is 
a large number. 


Application 


» DAX and FTSE are two stock market indexes for German and 
UK stock exchanges respectively. 


t. of India 


athshala 
Application eder 


» DAX and FTSE are two stock market indexes for German and 
UK stock exchanges respectively. 


> Suppose PP is the DAX value on t'^-day 


> and Pf is the FTSE value on t^^-day 


a Jathshala 
v. 


Application 9 uTO SITeIT 


» DAX and FTSE are two stock market indexes for German and 
UK stock exchanges respectively. 


> Suppose PP is the DAX value on t'^-day 
> and Pf is the FTSE value on t^^-day 
> Corresponding log-return is 


> rp = log(PP) — log( PP) 
> rj —log(F7)- log(P7;) 


Application 


» We want to model the relationship as 


rf —acfrP-ce 


where e ~ N (0,60?) 


t. of India 


athshala 
Application C LLLI 


» We want to model the relationship as 
rh =a+Brete 


where e ~ N (0,60?) 


> a 0 means FTSE is undervalued compare to DAX 


t. of India 


athshala 
Application eder 


» We want to model the relationship as 
rh =a+Brete 


where e ~ N (0,6?) 
> a > 0 means FTSE is undervalued compare to DAX 


> a < 0 means FTSE is overvalued compare to DAX 


athshala 
Application eder 


» We want to model the relationship as 
rh =a+Brp te 


where e ~ N (0,6?) 
> a > 0 means FTSE is undervalued compare to DAX 
> a <0 means FTSE is overvalued compare to DAX 


> a = 0 means FTSE is fairly valued compare to DAX 


Application 


» We want to model the relationship as 
rh =a+Brete 


where e ~ N (0,0?) 


t. of India 


athshala 
Application eder 


» We want to model the relationship as 
rh =a+Brete 


where e ~ N (0,0?) 


> DB > 1 means systematic risk of FTSE is more than DAX 


athshala 
Application ette 


» We want to model the relationship as 
rh =a+Brete 


where e ~ N (0,0?) 
> f > l1 means systematic risk of FTSE is more than DAX 


> f <1 means systematic risk of FTSE is less than DAX 


i . Ë Jathshala 
Application eS UST 


> We want to model the relationship as 
rf =a+ 6 rP +e 


where e ~ N (0,0?) 
> DB > 1 means systematic risk of FTSE is more than DAX 
> 6 < 1] means systematic risk of FTSE is less than DAX 


> 3 = 1 means systematic risk of FTSE is equal to DAX 
We consider only first 20 days of return, n = 20 


athshala 
UTSATT 


Q 
O 
- e 
m =f e 
E: BW 
(= Sam [7] 
9 $5 o 
F a o 
N 
e 
So 4 o 
9 T T T T T T rT T T T 1 
0 200 400 600 800 -0.002 0.002 0.006 
Trace plot of alpha alpha 
wo 
=] E 
> 2 
[to] E ae 
S] 2 
4 a o 
e 
i 
oT 2 
T T T T T T e T T T T 1 
0 200 400 600 800 -05 00 05 10 15 
Trace plot of beta: DAX beta: DAX 


26 / 27 


athshala 
GE UTS Mell 


t. of India 


Application 


Summary 

alpha beta:DAX 
2.5% 0.00040 -0.15973 
50% 0.00327 0.38210 
97.5% 0.00600 0.94316 


» 95% Cl of o indicates that FTSE is under valued compare to 
DAX; since the interval is entirely above 0. 


> 95% Cl of B indicates that the systematic risk of FTSE is less 
than DAX; since the interval doesnot include 1 or more 
precisely 


POS 1|Data) = T n T I(gO <1) 
> length(b.star[b.star[,2]<1,2])/1000 
[1] 0.984 


GE oa NW iGevt. of India 
ICICI 


A Gateway to all Postgraduate Courses An MHRD project under its National Mission on Education through ICT (NME-ICT) 


Subject: Statistics 


Paper: Statistical Inference 


Module: Bayesian Regression Analysis - Part 2 


MHRD 


Hi Govt. of India 


@Qethshala 
Development Team eS UST 


Principal investigator: Dr. Bhaswati Ganguli, Professor, 
Department of Statistics, University of Calcutta 

Paper co-ordinator: Dr. Dipak K Dey, Associate Dean and BOT 
Distinguished Professor, Department of Statistics, 
University of Connecticut 

Content writer: Dr. Sourish Das, Assistant Professor, Chennai 
Mathematical Institute 


Content reviewer: Department of Statistics, University of Calcutta 


t. of India 


Ak athshala 
Bayesian Setup ) STS RIT AT 


> For the normal linear model, we have: 


Yi ^v N (ui, o?) 


for i € (1,2,3, ..., n) where p; is just an indicator for the 
expression: 
hi = b1Xii us PRX 


vt. of India 


Ag athshala 
Bayesian Setup 9 uTóSITelT 


> For the normal linear model, we have: 


Ui ~ N(ui,ca?) 


for i € (1,2,3, ..., n) where p; is just an indicator for the 
expression: 
hi = Xu TP 


> The object of statistical inference is the posterior distribution 
of the parameters /;..., 8 and c? 


t. of India 


Ak athshala 
Bayesian Setup S OTST 


> For the normal linear model, we have: 


Ui ~ N(ui,ca?) 


for i € (1,2,3, ..., n) where p; is just an indicator for the 
expression: 
hi = Xu TP 


> The object of statistical inference is the posterior distribution 
of the parameters (31..., 8, and c? 


> By Bayes Rule, we know that this is simply: 


p(fi, ++) B, o? |y, X s ]In Yili, o? p( Pr, .. a Bi, 0?) 


Fe; Jathshala 
Conjugate priors and the normal linear mod S WresITelT 


> For the normal regression model, the conjugate prior 
distribution for p(fo, ..., Bk, 0?) is the normal-inverse-gamma 
distribution. 


ik Jathshala 
Conjugate priors and the normal linear mode ATST AT 


> For the normal regression model, the conjugate prior 
distribution for p(fo, ..., Bk, 0?) is the normal-inverse-gamma 
distribution. 


» We have seen this distribution before when we studied the 
normal model with unknown mean and variance. We know 
that this distribution can be factored such that: 


P(Bo, ET Pk, o?) = p(o, IT Bx|o?)p(c?) 


. d Jathshala RD 
Conjugate priors and the normal linear mod WMTOSITeIT d» 


> As we have studied normal model with unknown mean and 
variance. We know that this distribution can be factored such 
that: 
P(Bo, -- Bk. 0?) = p(Bo, ..., Bx|o?)p(c?) 


where p(o, ..., Bla’) ~ M Nx(Bo, Ao) 
and p(o?) ~ Inv — Gamma(ao, bo) 


APathshala 
Conjugate priors and the normal linear mode USAT 


vt. of India 


» As we have studied normal model with unknown mean and 
variance. We know that this distribution can be factored such 
that: 


P(Bo, soii Pk, c?) = p(o, ID Bx|c?)p(o?) 


where p(o, ..., Bela?) ~ M Nx(Bo, Ao) 
and p(o?) ~ Inv — Gamma(ao, bo) 


> |f we use a conjugate prior, then the prior distribution will 
have the same form. Thus, the posterior distribution will also 
follow normal-inverse-gamma. 


athshala 


dd 0 
Conjugate priors and the normal linear mod e ESTE] 


MHRD 


Iil Govt. of India 


» As we have studied normal model with unknown mean and 
variance. We know that this distribution can be factored such 
that: 


P(Bo, soii Pk, c?) = p(o, ID Bx|c?)p(o?) 


where p(o, ..., Bela?) ~ M Nx(Bo, Ao) 
and p(o?) ~ Inv — Gamma(ao, bo) 


> |f we use a conjugate prior, then the prior distribution will 
have the same form. Thus, the posterior distribution will also 
follow normal-inverse-gamma. 


> If we integrate out c? the marginal for 8 will be a multivariate 
t-distribution. 


IU ———Ü 51/12 ————— 


. Fe; Jathshala 
Conjugate priors and the normal linear mod S WroSITelT 


> Posterior mean: 


E(B|y, X) = (Ag! -- X X) (Ag! Bo + X" XB) 


. ik Jathshala RD 
Conjugate priors and the normal linear mod! WMTOSITeIT d» 


» Posterior mean: 


E(B|y, X) = (Ag! -- X X) (Ag! Bo + X" XB) 


» Notice that the coefficients are essentially a weighted average 
of the prior coefficients described by £y and standard OLS 
estimate 5. 


athshala 


HA 
Conjugate priors and the normal linear mod e ETT 


» Posterior mean: 


E(B|y, X) = (Ag! -- X X) (Ag! o + X" XB) 


> Notice that the coefficients are essentially a weighted average 
of the prior coefficients described by £y and standard OLS 
estimate 5. 


> The weights are provided by the conditional prior precision 
Ag | and the data XT X. 


athshala 


ufo 
Conjugate priors and the normal linear mod a EST 


t. of India 


» Posterior mean: 


E(B|y, X) = (Ag! -- X X) (Ag! o + X" XB) 


> Notice that the coefficients are essentially a weighted average 
of the prior coefficients described by £y and standard OLS 
estimate 5. 


> The weights are provided by the conditional prior precision 
Ag | and the data XT X. 


> This should make clear that as we increase our prior precision 
(decrease our prior variance) for 8 we place greater posterior 
weight on our prior beliefs relative to the data. 


di Jathshala 
Conjugate priors and the normal linear mod ATST AT 


vt. of India 


> Note: Zellner (1971) treats 89 and conditional prior variance 
Ao in the following way: 
suppose you have two data sets- (y1, X1) and (y2, X2). He 
sets 39 equal to the posterior mean for a regression analysis of 
(y1, X1) with the improper prior 1/c? and set Ao equal to 
XIX; 


k Jathshala 
Conjugate priors and the normal linear mods UST ay” 


> To summarize our uncertainty about the coefficients, the 
variance-covariance matrix for 8 is given by: 

8(Ag! + XTX)! 
n+a—-—k-3 


Cov(8|y, X) = 


(y - XB)" (y — XB) 
nc-a—k-—3 
+ (Bo — E(8)) A5! Bo + (Ê — E(8))! X1 XB 


> The posterior standard deviation can be taken from the square 
root of the diagonal of this matrix. 


208 Jathshala 
Is there a fairer test? e$ UTSATT 


» With the small sample size, it is difficult to say conclusively 
that the parties are not following their activists. 


i ÈR athshala 
Is there a fairer test? YTS AT 


> With the small sample size, it is difficult to say conclusively 
that the parties are not following their activists. 


> We can use the Bayesian method and incorporate prior 
probability about the data-generating process to the model. 


MHRD 
JJ Govt. of India 
Q2 


NU . . Ak athshala 
Justification for the Bayesian Approach ) STS SITeIT 


aiia dh 


» Wallerstein and Stephens reach an empirical impass, where 
they do not able to adjudicate between the two theories, 
because of the small sample size and multicollinear variables. 


10 / 23 


MHRD 
J]IN (Govt. of India 
EG) 


-—- @Qathshala 
Justification for the Bayesian Approach HUTSA d 
> Wallerstein and Stephens reach an empirical impass, where 

they do not able to adjudicate between the two theories, 
because of the small sample size and multicollinear variables. 


> The incorporation of prior information provides additional 
structure to the data, which helps to uniquely identify the two 
coefficients. 


10 / 23 


MHRD 


Iii Govt. of India 


@Qethshala 
Justification for the Bayesian Approach © QUST 


> Wallerstein and Stephens reach an empirical impass, where 
they do not able to adjudicate between the two theories, 
because of the small sample size and multicollinear variables. 


> The incorporation of prior information provides additional 
structure to the data, which helps to uniquely identify the two 
coefficients. 


> Priors can be developed as equivalent to prior data sets, 
inflating the de facto n. 


10 / 23 


Ë Jathshala 
A 


Justification for the Bayesian Approach "e UTS eT 


> Wallerstein and Stephens reach an empirical impass, where 
they do not able to adjudicate between the two theories, 
because of the small sample size and multicollinear variables. 


> The incorporation of prior information provides additional 
structure to the data, which helps to uniquely identify the two 
coefficients. 


> Priors can be developed as equivalent to prior data sets, 
inflating the de facto n. 


> The data set contains all available observations from a 
population of interest-it is not a random sample. More 
generally, cross-national data sets are not generated by a 
“repeatable” data-generating process. 


10 / 23 


MHRD 
JJ Govt. of India 
GU23 


. Ak athshala 
Justification for the Bayesian Approach ) STS SITeIT 


bom: 


> Frequentist inference about a statistic (e.g. a regression coef.) 
is obtained through the assumption that the process 
generating the data could be repeated a large number of 
times. 


11/23 


MHRD 
J]IN (Govt. of India 
SO, 


@Qathshala 
Justification for the Bayesian Approach © QUST 


> Frequentist inference about a statistic (e.g. a regression coef.) 
is obtained through the assumption that the process 
generating the data could be repeated a large number of 
times. 


> Specifically, frequentist inference is about the proportion of 
the time that, in the long-run, realizations of this statistic will 
fall within some interval. 


11/23 


MHRD 


Hi Govt. of India 


e Jathshala 
A 


Justification for the Bayesian Approach HSESIESTIIESH 


> Frequentist inference about a statistic (e.g. a regression coef.) 
is obtained through the assumption that the process 
generating the data could be repeated a large number of 
times. 


> Specifically, frequentist inference is about the proportion of 
the time that, in the long-run, realizations of this statistic will 
fall within some interval. 


> If there is no long-run, or possibility of repetition, then 
probabilistic summaries are not appropriate. 


11/23 


vt. of India 


eder 
Probit Regression ATST AT 


> Suppose we ha the data set D = ((yi, xi) |i = 1,2..., n} where 
x; is the k variate predictors. 


The response variable y is binary. 


eder 
Probit Regression ATST AT 


t. of India 


> Suppose we ha the data set D = {(y;, ;)|i = 1,2. 
x; is the k variate predictors. 


The response variable y is binary. 


|. J 1 Success 
970 Failure 


..,n) where 


where i = 1,2,...,n 


eder 
Probit Regression ATST AT 


> Suppose we ha the data set D = ((yi, xi) |i = 1,2..., n} where 
x; is the k variate predictors. 


The response variable y is binary. 


.. J 1 Success 
770 Failure 


where i — 1,2,...,n 
> The binary response model: 


P(y; = 1\xi,8) = (vB) 
P(yi = 0zi 8) = ®(—-a7'8) 


Probit Regression: 
Data Augmentation 


» Data Augmention is a very powerful tecnique 


Probit Regression: Ë athshala 
Data Augmentation TIE 


» Data Augmention is a very powerful tecnique 


» We re-write the Probit regression model in the following way: 
i a? B te, i—1,2,.,n 


where e; ~ N(0, 1) 


13 / 23 


Probit Regression: Ë athshala 
Data Augmentation al 


» Data Augmention is a very powerful tecnique 


» We re-write the Probit regression model in the following way: 
HO TERM 
zi =r; b+e;, i—1,2,.,n 


where e; ~ N (0,1) 
> Now we can say the observed data is being generated as 


o 1 z >0 
Y= 0 z<0 


13 / 23 


MHRD 


Govt. of India 


Probit Regression: a athshala ' 
Data Augmentation 3 TIE 


» Probit Model: 
B= el B+ €, t=1,2,..,n 


where e; ~ N(0, 1) 
» Now we can say the observed data is being generated as 


"S 1 2>0 
BS. 


14 / 23 


Probit Regression: a athshala & 
Data Augmentation ag rev as” 


> Probit Model: 
B= a? B +e, t=1,2,...,n 


where e; ~ N(0, 1) 


» Now we can say the observed data is being generated as 


"S 1 2>0 
So. 


» Prior: 


B ~ N (Bo, Ao) 


14 / 23 


Probit Regression: 
Data Augmentation 


» Conditional posterior distribution: 
p(B|z. y, v) = N(B|E(B|z, X), A3") 
where 
E(8|z, X) = (Ag! + XT X) W (Ag! o + X? XB) 


where 6 = (XT X)! XTz and A, = (Ag! - XT X) 


Probit Regression: 
Data Augmentation 


» Data generating process is: 


Probit Regression: 
Data Augmentation 


» Data generating process is: 


> Prior: B fl N (Bo, Ao) 


Probit Regression: 
Data Augmentation 


» Data generating process is: 
> Prior: B fl N (Bo, Ao) 


Probit Regression: 
Data Augmentation 


» Data generating process is: 
> Prior: B fl N (o, Ao) 


E 1 ifz»0 
Yi —3 9 otherwise 


16 / 23 


Probit Regression: 
Gibbs Sampler 


1 Choose an initial value BO 


Probit Regression: 
Gibbs Sampler 


1 Choose an initial value BO 
2 Form — 1,2,... 
| Sample z(?? as 


NON { TNooco)(zila7 BO" 1) ify = 1 
TN BD T) if y= 0 


Probit Regression: 
Gibbs Sampler 


1 Choose an initial value BO 
2 Form — 1,2,... 
| Sample z(?? as 


NON { TNooco)(zila7 BO" 1) ify = 1 
TN] BD T) if y= 0 


BC ~ N(B|E(B|z 9 2), Az) 


TL 


17/2 


t. of India 


i . ÈR athshala 
Application ) TOSITeIT 
» Consider 'birthwt' dataset available in MASS package of R 


» The dataset tries to look for the risk factors associated with 
low infant birth weight. 


18 / 23 


athshala 
Application eder 


» Consider 'birthwt' dataset available in MASS package of R 


» The dataset tries to look for the risk factors associated with 
low infant birth weight. 


1 indicator of birth weight less than 2.5 kg. 
low; = . 
0 otherwise 


18 / 23 


Ë Jathshala 
Application SHST ay” 


» Consider 'birthwt' dataset available in MASS package of R 


» The dataset tries to look for the risk factors associated with 
low infant birth weight. 


{ 1 indicator of birth weight less than 2.5 kg. 
low; = . 

0 otherwise 

i — 1,2...,189 


zi = Bo + Bi Age; + 2I (race; = black) 
+ (3I(race; = others) + B4I(Smoke; = yes) + ej 


18 / 23 


Ë Jathshala 
Application SHST ay” 


» Consider 'birthwt' dataset available in MASS package of R 


» The dataset tries to look for the risk factors associated with 


low infant birth weight. 


{ 1 indicator of birth weight less than 2.5 kg. 
low; = . 

0 otherwise 

i — 1,2...,189 


zi = Bo + fhAgei- BoI(race; = black) 

+ (3I(race; = others) + B4I(Smoke; = yes) + e; 
cei ~ N(0,1) and 
P(low = 1|Age, race, smoke) = P(z > 0|Age, race, smoke) 


18 / 23 


@Qethshala 
oS UTOSITeIT 


t. of India 


Application 


Posterior Summary 


mean sd 2.5% 97.5% 
Intercept -0.057 0.263 -0.579 0.456 


Black 0.246 0.213 -0.171 0.663 
Other 0.299 0.177 -0.051 0.648 
Age -0.032 0.012 -0.055 -0.008 
Smoke 0.380 0.170 0.044 0.711 


> The 95% Cl of the coefficient for race (black and others with 
respect to white) include 0. Therefore we can say race does 
not have effect on low birth weight of new born. 


19 / 23 


Ë Yathshala 1: MHRD 
o TSITAAT t. of India 


Application 


Posterior Summary 


mean sd 2.5% 97.5% 
Intercept -0.057 0.263 -0.579 0.456 


Black 0.246 0.213 -0.171 0.663 
Other 0.299 0.177 -0.051 0.648 
Age -0.032 0.012 -0.055 -0.008 
Smoke 0.380 0.170 0.044 0.711 


> The 95% CI of the coefficient for race (black and others with 
respect to white) include 0. Therefore we can say race does 
not have effect on low birth weight of new born. 


> The 95% Cl of the coefficient for age doesnot include 0. 
Negative coefficient indicate the for younder mothers the 
probabilty of low weight is higher. 


19 / 23 


Application 


Posterior Summary 


mean 
Intercept -0.057 
Black 0.246 
Other 0.299 
Age -0.032 
Smoke 0.380 


sd 
0.263 
0.213 
0.177 
0.012 
0.170 


2.5% 
-0.579 
-0.171 
-0.051 
-0.055 

0.044 


@Qethshala 
oS UTOSITeIT 


t. of India 


97.54 
0.456 
0.663 
0.648 
-0.008 
0.711 


> The 95% CI for the coefficient for smoking status does not 
include 0. It indicate that if a mother smokes during 
pregnancy than probability of low birth weight of her child is 
significantly higher. 


Intercept 


1.0 


0.5 


-0.5 0.0 


-1.0 


0 


T T 
4000 


Iteration 


| 
8000 


Density 


1.5 


1.0 


0.5 


0.0 


-1.0 0.0 1.0 
Intercept 


Jathshala 
JUTS AT 


21 / 23 


Govt. of India 


Jathshala 


e Q UToSITelT Govt. of India 
‘© a= 

e e 
oS = 
o 4 

= B S u 
e a 
[ e yJ 

J M 
co 
e 
e o 
l T I T I T T 

0 4000 8000 -0.08 -0.04 0.00 


Iteration Age 


22/23 


o. 
N 
ite} 
g - AN 
S E 
e 
E & 27 
M 
e 
S LN 
fe i oars oe al e pego d 
0 4000 8000 -0.2 02 06 1.0 


Iteration Smoke 


23 / 23 


MHRD 


ovt. of India 


athshala 
ef ) TTS ITeIT 


A Gateway to all Postgraduate Courses An MHRD project under its National Mission on Education through ICT (NME-ICT) 


Subject: Statistics 


Paper: Statistical Inference 
Module: More on MCMC 


MHRD 


Ii Govt. of India 


@Qethshala 
Development Team oS UST AT 


Principal investigator: Dr. Bhaswati Ganguli, Professor, 
Department of Statistics, University of Calcutta 

Paper co-ordinator: Dr. Dipak K Dey, Associate Dean and BOT 
Distinguished Professor, Department of Statistics, 
University of Connecticut 

Content writer: Dr. Sourish Das, Assistant Professor, Chennai 
Mathematical Institute 


Content reviewer: Department of Statistics, University of Calcutta 


Outline 


1. Gibbs Sampling 


Outline 


1. Gibbs Sampling 


2. Metropolis Hastings Algorithms 


Outline 


1. Gibbs Sampling 


2. Metropolis Hastings Algorithms 


3. MCMC Diagnostics 


vt. of India 


athshala 
What is Gibbs Sampling? etse 


» |n the last module we have seen that Gibbs sampling is being 
introduced in Probit regression model and the implementation 
becomes really simple ... 


vt. of India 


. . Ë Jathshala 
What is Gibbs Sampling? +9 FST eT 


> In the last module we have seen that Gibbs sampling is being 
introduced in Probit regression model and the implementation 
becomes really simple ... 


> Suppose we have a joint posterior distribution 
D(01, ..., 0s |data) that we want to sample from. 


MHRD 


Iil Govt. of India 


Ë Jathshala 
A 


What is Gibbs Sampling? NA UTS IT ott 


> In the last module we have seen that Gibbs sampling is being 
introduced in Probit regression model and the implementation 
becomes really simple ... 


> Suppose we have a joint posterior distribution 
p(61,...,0,|data) that we want to sample from. 


» We can use the Gibbs sampler to sample from the target 
distribution if we know the full conditional posterior 
distributions for each parameter. 


Ë Jathshala 
A 


What is Gibbs Sampling? "QUST 


> In the last module we have seen that Gibbs sampling is being 
introduced in Probit regression model and the implementation 
becomes really simple ... 


> Suppose we have a joint posterior distribution 
p(61, ..., 0; |data) that we want to sample from. 


» We can use the Gibbs sampler to sample from the target 
distribution if we know the full conditional posterior 
distributions for each parameter. 


> For each parameter, the full conditional posterior distribution 
is the distribution of the parameter conditional on the known 
information and all the other parameters: p(0;|0..;, y) 


Ei Jathshala ` 
The Hammersley-Clifford Theorem Got 


> For two blocks - Suppose we have a joint density f(x, y). 


athshala 
The Hammersley-Clifford Theorem estet 


> For two blocks - Suppose we have a joint density f(x,y). The 
theorem proves that we can write out the joint density in 
terms of the conditional densities. f(z|y) and f(y|x) 


f(ylz) 
f(z,y) = F 
J j Em 


» We can write the denominator as 


f(y|z) _ f(x) 
FC - j Fe” 


> Thus, our right-hand side is 


JO) _ syle) fe) 


= f(z,y) 


fx) 
> The theorem shows that knowledge of the conditional 
densities allows us to get the joint density. 


> Thus, our right-hand side is 


JG) _ syle) fe) 


= f(z,y) 


f(x) 
» The theorem shows that knowledge of the conditional 
densities allows us to get the joint density. 


> This works for more than two blocks of parameters. 


M 


M 


v 


Thus, our right-hand side is 


JG) _ syle) fe) 


= f(z,y) 


fx) 
The theorem shows that knowledge of the conditional 
densities allows us to get the joint density. 


This works for more than two blocks of parameters. 


But how do we figure out the full conditionals? 


t. of India 


a Jathshala ' 
Steps to Calculating Full Conditional DistriNe'$ UTSATT 


> Target posterior density p(0|y) 


vt. of India 


a Jathshala ` 
Steps to Calculating Full Conditional DistriNe'$S UTOSITeIT d 


> Target posterior density p(6|y) 


> To calculate the full conditionals for each 0: 
1. consider full posterior ignoring constant proportion 


athshala ` 
Steps to Calculating Full Conditional Disi ) STS kIT AT 


vt. of India 


> Target posterior density p(0|y) 


> To calculate the full conditionals for each 0: 
1. consider full posterior ignoring constant proportion 


2. split the parameter vector into 2, say 6 = (01, 02) 


athshala 
Steps to Calculating Full Conditional Distrito ) STS kIT AT 


vt. of India 


> Target posterior density p(0|y) 


> To calculate the full conditionals for each 0: 
1. consider full posterior ignoring constant proportion 


2. split the parameter vector into 2, say 6 = (01, 02) 


3. consider a block of parameters (say, 01) and ignore everything 
that doesn't depend on 01. 


d Jathshala 
Steps to Calculating Full Conditional Distri% WTOSITeIT 


vt. of India 


> Target posterior density p(0|y) 

> To calculate the full conditionals for each 0: 
1. consider full posterior ignoring constant proportion 
2. split the parameter vector into 2, say 6 = (01, 02) 


3. consider a block of parameters (say, 01) and ignore everything 
that doesn't depend on 01. 


4. Use the knowledge of distributions to figure out what is the 
normalizing constant (and thus what is the full conditional 
distribution p(01|0. 1, y)) 


APathshala 
Steps to Calculating Full Conditional Distri% WTOSITeIT 


vt. of India 


> Target posterior density p(0|y) 


> To calculate the full conditionals for each 0: 
1. consider full posterior ignoring constant proportion 


2. split the parameter vector into 2, say 6 = (01, 02) 


3. consider a block of parameters (say, 01) and ignore everything 
that doesn't depend on 01. 


4. Use the knowledge of distributions to figure out what is the 
normalizing constant (and thus what is the full conditional 


distribution p(01|0..1, y)) 


5. Repeat the steps for other blocks 


d Jathshala 
Steps to Calculating Full Conditional Distri% WTOSITeIT 


vt. of India 


> Target posterior density p(0|y) 

> To calculate the full conditionals for each 0: 
1. consider full posterior ignoring constant proportion 
2. split the parameter vector into 2, say 6 = (01, 02) 


3. consider a block of parameters (say, 01) and ignore everything 
that doesn't depend on 01. 


4. Use the knowledge of distributions to figure out what is the 
normalizing constant (and thus what is the full conditional 
distribution p(01|0..1, y)) 


5. Repeat the steps for other blocks 


You can generalize it for k blocks 


vt. of India 


athshala 
Gibbs Sampler Steps eder 


> Suppose that we are interested in sampling from the posterior 
p(0|y), where 0 is a vector of three parameters, 03, 05, 03. 


t. of India 


athshala 
Gibbs Sampler Steps ette 


> Suppose that we are interested in sampling from the posterior 
p(0|y), where 0 is a vector of three parameters, 03, 05, 03. 


» The steps to a Gibbs Sampler are 


vt. of India 


athshala 
Gibbs Sampler Steps ette 


> Suppose that we are interested in sampling from the posterior 
p(0|y), where 0 is a vector of three parameters, 03, 05, 03. 


> The steps to a Gibbs Sampler are 


1. Pick a vector of starting values 00 (Defining a starting 
distribution II) and drawing 0) from it.) 


vt. of India 


Ë Jathshala 
v. 


Gibbs Sampler Steps 9 uTO SITeIT 


> Suppose that we are interested in sampling from the posterior 
p(0|y), where 0 is a vector of three parameters, 03, 05, 03. 


> The steps to a Gibbs Sampler are 


1. Pick a vector of starting values 00 (Defining a starting 
distribution II) and drawing 0) from it.) 


2. Draw a value e? from the full conditional po, 60 y). 


vt. of India 


@Qethshala 
Gibbs Sampler Steps ~ QUST 


> Suppose that we are interested in sampling from the posterior 
p(0|y), where 0 is a vector of three parameters, 03, 05, 03. 


> The steps to a Gibbs Sampler are 


1. Pick a vector of starting values ©) (Defining a starting 
distribution II) and drawing 0) from it.) 


2. Draw a value e? from the full conditional po, 60 y). 


3. Draw a value at? from the full conditional 
p(o”, 099, y). 


vt. of India 


athshala 
Gibbs Sampler Steps ette 


> Suppose that we are interested in sampling from the posterior 
p(0|y), where 0 is a vector of three parameters, 03, 05, 03. 


> The steps to a Gibbs Sampler are 
1. Pick a vector of starting values 6 (Defining a starting 
distribution 11) and drawing 0) from it.) 


2. Draw a value e? from the full conditional po, 60 y). 


3. Draw a value at? from the full conditional 
p(0, 90? 89.) (Note that we must use the updated value of 


o.) 


athshala 
etse 


vt. of India 


Gibbs Sampler Steps 


> Suppose that we are interested in sampling from the posterior 
p(0|y), where 0 is a vector of three parameters, 03, 05, 03. 


> The steps to a Gibbs Sampler are 
1. Pick a vector of starting values 6 (Defining a starting 
distribution II and drawing 0) from it.) 


2. Draw a value e? from the full conditional po, 60 y). 


3. Draw a value at? from the full conditional 
p(6, o? aO y) (Note that we must use the updated value of 


o.) 


4. Draw a value at? from the full conditional p(65|9 9, af) y). 


Gibbs Sampler Steps 


> Steps 2-4 are analogous to multiplying 11 and P to get I9 
and then drawing 0( from II). 


t. of India 


athshala 
Gibbs Sampler Steps eder 


> Steps 2-4 are analogous to multiplying II?) and P to get II(U 
and then drawing 0(U from II), 


5 Draw 0°) using 6“ and continually using the most updated. 


t. of India 


athshala 
Gibbs Sampler Steps eder 


> Steps 2-4 are analogous to multiplying II?) and P to get II(U 
and then drawing 0) from IE), 


5 Draw 6(2 using 6“ and continually using the most updated. 


6 Repeat untill you have M draws with each draw being a vector 
6). values. 


vt. of India 


athshala 
Gibbs Sampler Steps eder 


> Steps 2-4 are analogous to multiplying II?) and P to get II(U 
and then drawing 0(U from IE), 
5 Draw 0°) using 6“ and continually using the most updated. 


6 Repeat untill you have M draws with each draw being a vector 
6). values. 
7 Optional burn-in and/or thinning. 


vt. of India 


f é Jathshala 
Gibbs Sampler Steps © QUST 


> Steps 2-4 are analogous to multiplying II?) and P to get II(U 
and then drawing 0) from IE), 


5 Draw 0°) using 6“ and continually using the most updated. 


6 Repeat untill you have M draws with each draw being a vector 
6). values. 
7 Optional burn-in and/or thinning. 
> ‘Gibbs sample’ is a Markov chain with a bunch of draws of 0 
that are approximately from our posterior. 


Ei Jathshala |: RD 
Application from Robert and Casella,(2004 eS uresITeT db" 


» Consider the Poisson-Gamma hierarchical model discussed in 
Module 7 on hierarchical models 


athshala 
Application from Robert and Casella, (2004 8f ) TO SITeIT 


» Consider the Poisson-Gamma hierarchical model discussed in 
Module 7 on hierarchical models 


> Suppose we have data of the number of failures (y; ) for each 
of 10 pumps in a nuclear plant. 


athshala 
Application from Robert and Casella, (2004 8f ) TO SITeIT 


» Consider the Poisson-Gamma hierarchical model discussed in 
Module 7 on hierarchical models 


> Suppose we have data of the number of failures (y; ) for each 
of 10 pumps in a nuclear plant. 


» We also have the times (t; ) at which each pump was 
observed. 


A Pathshala 
Application from Robert and Casella, (2004 WVTOSITeIT 


t. of India 


» Consider the Poisson-Gamma hierarchical model discussed in 
Module 7 on hierarchical models 


> Suppose we have data of the number of failures (y; ) for each 
of 10 pumps in a nuclear plant. 


> We also have the times (t; ) at which each pump was 
observed. 


> y <- c(5, 1, 5, 14, 3, 19, 1, 1, 4, 22) 
>t «- c(94, 16, 63, 126, 5, 31, 1, 1, 2, 10) 
» rbind(y,t) 


[,1] [,2] L,3] L,4] L,5] L,6] L7] [,8] L,9] L,10] 
y 5 1 5 14 — 19 1 1 4 22 
t 94 16 63 126 5 31 1 1 2 10 


Application: Gibbs Sampler a athshala & 
for Poisson Gamma Models ect 


> Our objective is to model the number of failures with a 
Poisson model, where the expected number of failure A; 
differs for each pump. 


athshala 
WJTOSITeIT 


MHRD 


cout. of India 


Application: Gibbs Sampler a 
for Poisson Gamma Models DE 


» Our objective is to model the number of failures with a 
Poisson model, where the expected number of failure A; 
differs for each pump. 


> Since the observed time for each pump is different; it is 
required to scale each A; by its observed time t; . 


athshala 
UTSATT 


MHRD 


cout. of India 


Application: Gibbs Sampler a 
for Poisson Gamma Models 3e: 


» Our objective is to model the number of failures with a 
Poisson model, where the expected number of failure A; 
differs for each pump. 


> Since the observed time for each pump is different; it is 
required to scale each A; by its observed time t; . 


> Poisson likelihood: 


10 
II Poisson(;t;) 


i—l 


11 / 46 


athshala 
UTSATT 


MHRD 


cout. of India 


Application: Gibbs Sampler à 
for Poisson Gamma Models p: 


» Our objective is to model the number of failures with a 
Poisson model, where the expected number of failure A; 
differs for each pump. 


> Since the observed time for each pump is different; it is 
required to scale each A; by its observed time t; . 


» Poisson likelihood: 


10 
II Poisson(;t;) 


i=1 


> consider Gamma(a, B) prior on A; with a = 2, so A;'s are 
drawn from the same distribution. 


11 / 46 


Application: Gibbs Sampler 
for Poisson Gamma Models 


> Suppose Gamma(v,6) prior on B with v = 0.01 and 6 = 1 


Application: Gibbs Sampler 
for Poisson Gamma Models 


> Suppose Gamma(v,6) prior on 6 with v = 0.01 and 6 = 1 


> So our model has 11 parameters that are unknown (10 A; s 


and 8). 


Application: Gibbs Sampler Ë athshala 
for Poisson Gamma Models EDI d 


> Suppose Gamma(v, ô) prior on B with v = 0.01 and ô = 1 


> So our model has 11 parameters that are unknown (10 A; s 


and 8). 


» Full posterior is 


10 

p(A, Bly, t) x (I Poisson(r;t;) x E) 
i=1 

x Gamma(v, ô) 


Application: Gibbs Sampler 
for Poisson Gamma Models 


10 
p(A,B|y,t) « (I Poisson(AX;t;) x E) 
i=l 
x Gamma(v, à) 


10 ; 
"EM — iti (Acti) p7 a;—-1 — Bj 
= (II e "EB x T(a) Ai € 


BU gvysi cda 
x——0" e 
Tv) 


(Ies x Be 1 en) x foe 
10 
ox (II P ME 1 e (tit B)A wn) pe 1,-68 


i=1 


R 


13 / 46 


Application: Gibbs Sampler 
for Poisson Gamma Models 


» Full conditional posterior distributions: 
1. p(Ai|A-s, B, y, t) oc APTS e n 


Application: Gibbs Sampler 
for Poisson Gamma Models 


» Full conditional posterior distributions: 
1 p(Ailà-i B, y, t) oc QC e= HtA) 


pA i, B, y. t) is Gamma(yi En a, ti E B) 


Application: Gibbs Sampler 
for Poisson Gamma Models 


» Full conditional posterior distributions: 
1. pu By, t) oc A er Gets 


DNA i, B, y, t) is Gamma(yi En a, t; a B) 


2. p(B|A, y, t) « e BG 35i, Ai) g10o-rv—1 


Application: Gibbs Sampler 
for Poisson Gamma Models 


» Full conditional posterior distributions: 
1. psa By, t) oc A er Gets 


P(Aj| Aa; B, y, t) is Gamma(yi + Q, ti + B) 
2. p(B|A, y, t) ex e-8 (63712, X) g10a+v—1 


r(B|A y, t) is Gamma(10a + v, ô + 35:54 Aj) 


Coding the Gibbs Sampler 


1. Define starting values of 6) 


Coding the Gibbs Sampler 


1. Define starting values of 6) 
> beta.cur«-1 


athshala RD 
Coding the Gibbs Sampler Geste a» 


1. Define starting values of 6) 


> beta.cur<-1 


2. Draw A“) from its full conditional (we draw all the A;'s as a 
block because they all only depend on 8 and not each other) 


t. of India 


Ë Jathshala 
Coding the Gibbs Sampler +9 TST 


1. Define starting values of 6) 
> beta.cur«-1 

2. Draw \“) from its full conditional (we draw all the A;'s as a 
block because they all only depend on 8 and not each other) 


> lambda.update <- function(alpha, beta, y, t) { 
+ rgamma(length(y), y + alpha, t + beta) 
* 


@Qathshala 
v. 


Coding the Gibbs Sampler 9 uTóSITelT 


1. Define starting values of 6) 
> beta.cur«-1 

2. Draw \“) from its full conditional (we draw all the A;'s as a 
block because they all only depend on 8 and not each other) 


> lambda.update <- function(alpha, beta, y, t) { 
+ rgamma(length(y), y + alpha, t + beta) 
+} 


3. Draw B® from its full conditional, using A 


15 / 46 


@Qathshala 
oS UTOSITeIT 


t. of India 


Coding the Gibbs Sampler 


1. Define starting values of 6) 
> beta.cur«-1 

2. Draw \“) from its full conditional (we draw all the A;'s as a 
block because they all only depend on 8 and not each other) 
> lambda.update <- function(alpha, beta, y, t) { 
+ rgamma(length(y), y + alpha, t + beta) 
+ +} 

3. Draw B® from its full conditional, using A? 
> beta.update «- function(alpha, gamma, delta 


+ , lambda, y) { 

+ rgamma(1, length(y) * alpha + gamma 
+ , delta + sum(lambda)) 
+F 


15 / 46 


Coding the Gibbs Sampler 


4 Repeat using most updated values until we get M draws. 


Coding the Gibbs Sampler 


4 Repeat using most updated values until we get M draws. 


5 Optional burn-in and thinning. 


vt. of India 


athshala 
Coding the Gibbs Sampler Geste 


4 Repeat using most updated values until we get M draws. 
5 Optional burn-in and thinning. 


6 Write it as a function so that you can use it repeatedly. 


@Qethshala 
oS UTOSITeIT 


vt. of India 


Coding the Gibbs Sampler 


7 Do Monte Carlo Integration on the resulting Markov chain, 
which are samples approximately from the posterior. 


> round(colMeans (posterior$lambda.draws),digit-2) 
[1] 0.07 0.16 0.11 0.12 0.65 0.62 0.85 0.86 1.31 1.90 
> round(mean(posterior$beta.draws),digit-2) 
[1] 2.68 
> round(apply(posterior$lambda.draws, 2, sd),digit-2) 
[1] 0.03 0.09 0.04 0.03 0.30 0.14 0.54 0.54 0.58 0.40 
> round(sd(posterior$beta.draws),digit-2) 
[1] 0.75 


17 / 46 


t. of India 


The Metropolis-Hastings Algorithm 


> Suppose we have a posterior p(0|y) that we want to sample 
from, but 


vt. of India 


Ë Jathshala ` 
The Metropolis-Hastings Algorithm QUST 


> Suppose we have a posterior p(0|y) that we want to sample 
from, but 


> the posterior does not look like any known distribution that we 
know 


MHRD 
JJ Govt. of India 
Q2 


. . . Ag athshala 
The Metropolis-Hastings Algorithm ) STS RIT AT 


aiia dh 


> Suppose we have a posterior p(0|y) that we want to sample 
from, but 


> the posterior does not look like any known distribution that we 
know 


> |f the posterior consists of more than 2 parameters then grid 
approximations is not possible. 


MHRD 
Hil Govt. of India 
RO, 


@Qethshala 
The Metropolis-Hastings Algorithm QUST 


> Suppose we have a posterior p(0|y) that we want to sample 
from, but 
> the posterior does not look like any known distribution that we 
know 


> |f the posterior consists of more than 2 parameters then grid 
approximations is not possible. 


» some (or all) of the full conditionals do not follow known 


distributions. Therefore Gibbs sampling is not possible for 
those whose full conditionals are not known. 


MHRD 


Hi Govt. of India 


@Qathshala 
The Metropolis-Hastings Algorithm QUST 


> Suppose we have a posterior p(0|y) that we want to sample 
from, but 


> the posterior does not look like any known distribution that we 
know 


> |f the posterior consists of more than 2 parameters then grid 
approximations is not possible. 


» some (or all) of the full conditionals do not follow known 
distributions. Therefore Gibbs sampling is not possible for 
those whose full conditionals are not known. 


> In such cases, we can use the Metropolis-Hastings algorithm. 


18 / 46 


The Metropolis-Hastings Algorithm 


The Metropolis-Hastings Algorithm : 


1. Choose starting value (°) 


Ak athshala 
The Metropolis-Hastings Algorithm ) STS RIT AT 


The Metropolis-Hastings Algorithm : 


1. Choose starting value (? 


2. At iteration t, draw a candidate 0* from a proposal 
distribution q;(0*|0*—1)). 


. . Ak athshala RD 
The Metropolis-Hastings Algorithm JUTOSITeIT aay" 


The Metropolis-Hastings Algorithm : 


1. Choose starting value (? 


2. At iteration t, draw a candidate 0* from a proposal 
distribution q,(0*|0*—1)). 


3. Compute the acceptance ratio 


—— Dy) allo) 
PODI / (0 €79]0*) 


The Metropolis-Hastings Algorithm 


4 Accept 0 = 6* if U < min(1, r) where U ~ uniform(0, 1) 
Otherwise 0) = g(t-0) 


. . . Ag athshala (© RD 
The Metropolis-Hastings Algorithm JUTOSITeIT d^ 


4 Accept 0 = 0* if U < min(1,r) where U ~ uniform(0, 1) 
Otherwise 0) = g(t) 


5 Repeat steps 2-4 M times to get M draws from p(0|y) with 
optional burn-in and/or thinning 


Step 1: Choose a starting value 00) 


» This is similar to draw sample from initial stationary 
distribution. 


t. of India 


g é Jathshala ` 
Step 1: Choose a starting value 0% oO UTOSITeIT 


> This is similar to draw sample from initial stationary 
distribution. 


> Note that 0) must have positive probability. 


p(9 y) > 0 


t. of India 


Ei Jathshala ' 
Step 2: Draw 0) from q,(0*|04-") Got 


> The proposal distribution q;(0*|0*7 P) determines where it 
moves in the next iteration of the Markov chain. 


vt. of India 


athshala 
Step 2: Draw 0) from q,(0*|04-") C ETE 


> The proposal distribution q;(0*|0*7 P) determines where it 
moves in the next iteration of the Markov chain. 


> The proposal distribution is analogous to transition kernel. 


vt. of India 


Step 2: Draw 0) from q,(0*|04-") «S UTOSITeIT 
> The proposal distribution q;(0*|0*—) determines where it 
moves in the next iteration of the Markov chain. 
> The proposal distribution is analogous to transition kernel. 


> The support of proposal density must contain the support of 
the posterior 


vt. of India 


@Qethshala 
Step 2: Draw 0) from q,(6*|0—)) «S UTOSITeIT 
> The proposal distribution q;(0*|0*—U) determines where it 
moves in the next iteration of the Markov chain. 
> The proposal distribution is analogous to transition kernel. 


> The support of proposal density must contain the support of 
the posterior 


> If a symmetric proposal distribution that is dependent on 
6-1) then it is known as random walk Metropolis sampling. 


athshala RD 
Step 2: Draw 0) from q,(0*|0 4") ette di 


> If the proposal distribution does not depend on 0^7) 
q (0710079) = (0) 


then it is known as independent Metropolis-Hasting sampling 


"v e Jathshala 
Step 2: Draw 0) from q,(0*|0 4") «S UTOSITeIT 


> If the proposal distribution does not depend on 0*7 
qi(0* 9479) = q(07) 


then it is known as independent Metropolis-Hasting sampling 


» |n this case all candidate draws 0* are drawn from the same 
distribution, regardless of where the previous draw was. 


t. of India 


die 
Step 2: Draw 0) from q,(0*|0 4") e 


> If the proposal distribution does not depend on 0*7 O 
qi(0* 9479) = q(07) 


then it is known as independent Metropolis-Hasting sampling 


» |n this case all candidate draws 0* are drawn from the same 
distribution, regardless of where the previous draw was. 


> This can be very efficient or very inefficient, depending how 
close the proposal density with respect to posterior density. 


e Jathshala i: MHRD 
^h a UTSI AT t. of India 


Step 2: Draw 0) from q,(0*|0 4") 


> If the proposal distribution does not depend on 0*7 D 
qi 0*9 479) = q (0*) 


then it is known as independent Metropolis-Hasting sampling 


» |n this case all candidate draws 0* are drawn from the same 
distribution, regardless of where the previous draw was. 


> This can be very efficient or very inefficient, depending how 
close the proposal density with respect to posterior density. 


> Typically, if proposal density has heavier tails than the 
posterior, then chain will behave nicely 


Step 3: Compute acceptance ratio r . 


„ — PO ly)/a (0*0?) 
p(9€-0|y)/q (0*—0|0*) 


Ë Jathshala 
Step 3: Compute acceptance ratio r . +9 FST 


„P(O ly) /a (0*0 7)) 
PODI) / a0) 


» |n the case where our proposal density is symmetric 


„ — PG" ly) 
p(9—Yy) 


athshala 
Step 3: Compute acceptance ratio r . eder 


_ »(0*|y)/a (07007?) 
p(06—[y)/q (00 0") 


» |n the case where our proposal density is symmetric 


p(y) 
p(t |y) 


> |f the candidate sample has higher probability than the current 
sample, then the candidate is better so we definitely accept it. 


24 / 46 


Ë Jathshala 
Step 3: Compute acceptance ratio r . © OUTST 


_ »(0*|y)/a (07007?) 
p(06—[y)/q (00 0") 


> In the case where our proposal density is symmetric 


_ _ Ply) 
p(y) 


> |f the candidate sample has higher probability than the current 
sample, then the candidate is better so we definitely accept it. 


» Otherwise, the candidate is accepted according to the ratio of 
the probabilities of the candidate and current samples. 


24 / 46 


e Jathshala 
oS UTOSITeIT 


t. of India 


Step 3: Compute acceptance ratio r . 


_ »(0*|y)/a (07007?) 
p(06—[y)/q (00 0") 


> In the case where our proposal density is symmetric 


_ _ Ply) 
p(y) 


> |f the candidate sample has higher probability than the current 
sample, then the candidate is better so we definitely accept it. 


» Otherwise, the candidate is accepted according to the ratio of 
the probabilities of the candidate and current samples. 


> Since r is a ratio, we only need to know the p(0|y) upto a 
constant. 


24 / 46 


Step 4: Decide to whether accept 6* 


> Accept 6* as 0? with probability min(r, 1) 


Step 4: Decide to whether accept 0* 


> Accept 0* as 0? with probability min(r, 1) 


> For each 0*, draw a value u from the Uni form(0, 1) 
distribution. 


athshala (: RD 
Step 4: Decide to whether accept 0* oat a” 


> Accept 6* as 0 with probability min(r, 1) 


> For each 0*, draw a value u from the Uni form(0, 1) 
distribution. 


> |f u € r accept 6* as 60. Otherwise 0¢-) = 4) 


athshala 
Step 4: Decide to whether accept 0* eder 


> Accept 0* as 0) with probability min(r, 1) 


> For each 6", draw a value u from the Uniform(0, 1) 
distribution. 


> If u <r accept 6* as 6). Otherwise 66) = g(0 


» Candidate sample with higher density than the current 
samples are always accepted 


a Jathshala 
v. 


Step 4: Decide to whether accept 0* 9 uTO SITeIT 


> Accept 0* as 0(? with probability min(r, 1) 


> For each 6*, draw a value u from the Uniform(0, 1) 
distribution. 


> If u <r accept 6* as 00. Otherwise 66) = g(0 


» Candidate sample with higher density than the current 
samples are always accepted 


> Unlike in rejection sampling, each iteration always produces a 
sample, either 0* or 64, 


MHRD 
JJ Govt. of India 
RO 


athshala 
Acceptance Rate estet 


aia dh 


> |t is important to monitor the acceptance rate (the fraction of 
candidate samples that are accepted) of your 
Metropolis-Hastings algorithm. 


26 / 46 


MHRD 


J]IXj Govt. of India 


a Jathshala 
Acceptance Rate ee UST 


> |t is important to monitor the acceptance rate (the fraction of 
candidate samples that are accepted) of your 
Metropolis-Hastings algorithm. 


» |f acceptance rate is too high, the chain is probably not 


mixing well. That is the chain is not moving around the 
parameter space quickly enough. 


26 / 46 


a Jathshala 
e 


Acceptance Rate arosat 


> It is important to monitor the acceptance rate (the fraction of 
candidate samples that are accepted) of your 
Metropolis-Hastings algorithm. 


> |f acceptance rate is too high, the chain is probably not 
mixing well. That is the chain is not moving around the 


parameter space quickly enough. 


» |f the acceptance rate is too low, your algorithm is too 
inefficient, that is rejecting too many candidate samples. 


26 / 46 


Ë Jathshala 
e 


Acceptance Rate "S ITOSITeIT 


> |t is important to monitor the acceptance rate (the fraction of 
candidate samples that are accepted) of your 
Metropolis-Hastings algorithm. 


> |f acceptance rate is too high, the chain is probably not 
mixing well. That is the chain is not moving around the 
parameter space quickly enough. 


» |f the acceptance rate is too low, your algorithm is too 
inefficient, that is rejecting too many candidate samples. 


» What is too high and too low depends on the specific 
algorithm, but generally 
» random walk: somewhere between 0.25 and 0.50 is 
recommended 


26 / 46 


Ë Jathshala 
e 


Acceptance Rate "S ITOSITeIT 


> |t is important to monitor the acceptance rate (the fraction of 
candidate samples that are accepted) of your 
Metropolis-Hastings algorithm. 


> |f acceptance rate is too high, the chain is probably not 
mixing well. That is the chain is not moving around the 
parameter space quickly enough. 


» |f the acceptance rate is too low, your algorithm is too 
inefficient, that is rejecting too many candidate samples. 


» What is too high and too low depends on the specific 
algorithm, but generally 
» random walk: somewhere between 0.25 and 0.50 is 
recommended 


» independent: something close to 1 is preferred 


26 / 46 


d D Jathshala 
MCMC Convergence «S UST 


» After the model has converged, samples from the conditional 
distributions are used to summarize the posterior distribution 
of parameters of interest 0. 


27 / 46 


MHRD 


Hi Govt. of India 


Ë Jathshala 
MCMC Convergence +9 QTOSITeIT 


» After the model has converged, samples from the conditional 
distributions are used to summarize the posterior distribution 
of parameters of interest 0. 


> Convergence refers to the idea that eventually the Gibbs 
Sampler or other MCMC technique that we choose will 
eventually reach a stationary distribution. 


27 / 46 


4) MHRD 
J]IM (Govt. of India 


@Qethshala 
MCMC Convergence +9 TST 


» After the model has converged, samples from the conditional 
distributions are used to summarize the posterior distribution 
of parameters of interest 0. 


> Convergence refers to the idea that eventually the Gibbs 
Sampler or other MCMC technique that we choose will 
eventually reach a stationary distribution. 


> From this point onwards it stays in this distribution and 
moved about or "mixes" throughout the subspace forever. 


27 / 46 


MCMC Convergence 


» The general questions for us: 


28 / 46 


Bethea! athshala '"MHRD 


MCMC Convergence +9 TST 


> The general questions for us: 


1. At what point do we know that have we converged to the 
stationary distribution? (i.e. how long should our “burn-in” 
period be? 


28 / 46 


athshala 
MCMC Convergence Geste 


> The general questions for us: 


1. At what point do we know that have we converged to the 
stationary distribution? (i.e. how long should our "burn-in" 
period be? 


2. After we have reached the stationary distribution, how many 
iterations will it take to summarize the posterior distribution? 


Ë Jathshala 
MCMC Convergence +9 QTOSITeIT 


> The general questions for us: 


1. At what point do we know that have we converged to the 
stationary distribution? (i.e. how long should our "burn-in" 
period be? 


2. After we have reached the stationary distribution, how many 
iterations will it take to summarize the posterior distribution? 


> The answers to both of these questions remain a bit ad hoc 
because the desirable results that we depend on are only true 
asymptotically, and we donot want to wait for an infinite 
number of draws. 


Possible problems with MCMC 


> The assumed model may not be realistic from a substantive 
point of view or may not fit. 


29 / 46 


Ë Jathshala 
oS) WORT AT 


Possible problems with MCMC 


> The assumed model may not be realistic from a substantive 
point of view or may not fit. 


> Errors in calculation or programming! 

- Often, simple syntax mistakes may be responsible; however, it 
is possible that the algorithm may not converge to a proper 
distribution. 


29 / 46 


Ë Jathshala 
A 


Possible problems with MCMC "e UTSNMAT 


» [he assumed model may not be realistic from a substantive 
point of view or may not fit. 


> Errors in calculation or programming! 


- Often, simple syntax mistakes may be responsible; however, it 
is possible that the algorithm may not converge to a proper 
distribution. 


» Slow convergence: this is the problem we are most likely to 
run into. The simulation can remain for many iterations in a 
region heavily influenced by the starting distribution. If the 
iterations are used to summarize the target distribution, they 
can yield falsely precise estimates. 


29 / 46 


Ë Jathshala 
A 


Possible problems with MCMC "e UTSNMAT 


» [he assumed model may not be realistic from a substantive 
point of view or may not fit. 


> Errors in calculation or programming! 


- Often, simple syntax mistakes may be responsible; however, it 
is possible that the algorithm may not converge to a proper 
distribution. 


» Slow convergence: this is the problem we are most likely to 
run into. The simulation can remain for many iterations in a 
region heavily influenced by the starting distribution. If the 
iterations are used to summarize the target distribution, they 
can yield falsely precise estimates. 


this will be the focus of our discussion. 


29 / 46 


athshala 
Traceplots Qüsteem t 


» One intuitive and easily implemented diagnostic tool is a 
traceplot (or history plot) which plots the parameter value at 
time t against the iteration number. 


MHRD 
Hil Govt. of India 
RO, 


@Qethshala 
Traceplots eS UTS TT 


> One intuitive and easily implemented diagnostic tool is a 
traceplot (or history plot) which plots the parameter value at 
time t against the iteration number. 


> If the model has converged, the traceplot will move snake 
around the mode of the distribution. 


30 / 46 


MHRD 


Hi Govt. of India 


Ë Jathshala 
A 


Traceplots Loter 


» One intuitive and easily implemented diagnostic tool is a 
traceplot (or history plot) which plots the parameter value at 
time t against the iteration number. 


» |f the model has converged, the traceplot will move snake 
around the mode of the distribution. 


> A clear sign of non-convergence with a traceplot occurs when 
we observe some trending in the sample space. 


30 / 46 


Ë Jathshala 
A 


Traceplots "S UTOSITelIT 


> One intuitive and easily implemented diagnostic tool is a 
traceplot (or history plot) which plots the parameter value at 
time t against the iteration number. 


» |f the model has converged, the traceplot will move snake 
around the mode of the distribution. 


> A clear sign of non-convergence with a traceplot occurs when 
we observe some trending in the sample space. 


» The problem with traceplots is that it may appear that we 
have converged, however, the chain trapped (for a finite time) 
in a local region rather exploring the full posterior. 


30 / 46 


Examples of Apparent Convergence and 
Non-Convergence Based on a trace plot 


alpha 


beta 


75 50 25 


BEETLES 


Trace of beetles1 
(10000 values) 


dd 


o 210°3 4'10^3 6":0^3 8'10^3 10% 


iteration 


Trace of beetles1 
(10000 values) 


iSERRH MEUSE 


o 2'10^3 4°10°S 6103 8103 10% 


iteration 


alpha 


beta 


MHRD 


ovt. of India 


Trace of beetles2 
(10000 values) 


i C REN 


o 2"10^3 4710°3 6':0^3 8^"10^3 10% 


75 50 25 


iteration 


Trace of beetles2 
(10000 values) 


e ge 


o 2"10^3 4"10^3 6''0^3 8'10^3 TO^4 


iteration 


'"MHRD 


. i . Ak athshala 
Autocorrelation Diagnostic ) STS RIT AT 


> Autocorrelation refers to a pattern of serial correlation in the 
chain, where sequential draws of a parameter, say 04, from the 
conditional distribution are correlated. 


MHRD 


Hi Govt. of India 


i . a Jathshala 
Autocorrelation Diagnostic QUST 


> Autocorrelation refers to a pattern of serial correlation in the 
chain, where sequential draws of a parameter, say 0, from the 
conditional distribution are correlated. 


> The cause of autocorrelation is that the parameters in our 
model may be highly correlated, so the Gibbs Sampler will be 
slow to explore the entire posterior distribution. 


MHRD 


T ovt. of India 


a Jathshala 
A 


Autocorrelation Diagnostic "$ YTOSITeIT 


» Autocorrelation refers to a pattern of serial correlation in the 
chain, where sequential draws of a parameter, say 0, from the 
conditional distribution are correlated. 


> The cause of autocorrelation is that the parameters in our 
model may be highly correlated, so the Gibbs Sampler will be 
slow to explore the entire posterior distribution. 


> The reason why autocorrelation is important is that it will take 
a very long time to explore the entire posterior distribution. 


MHRD 


T ovt. of India 


a Jathshala 
A 


Autocorrelation Diagnostic "$ YTOSITeIT 


» Autocorrelation refers to a pattern of serial correlation in the 
chain, where sequential draws of a parameter, say 0, from the 
conditional distribution are correlated. 


> The cause of autocorrelation is that the parameters in our 
model may be highly correlated, so the Gibbs Sampler will be 
slow to explore the entire posterior distribution. 


> The reason why autocorrelation is important is that it will take 
a very long time to explore the entire posterior distribution. 


208 Jathshala 
Autocorrelation Diagnostic «S UTOSITeIT 


» Note that if the level of autocorrelation is high for a 
parameter of interest, then a traceplot will be a poor 
diagnostic for convergence. 


MHRD 


T ovt. of India 


Ë Jathshala 
A 


Autocorrelation Diagnostic "$ YTOSITeIT 


» Note that if the level of autocorrelation is high for a 
parameter of interest, then a traceplot will be a poor 
diagnostic for convergence. 


> Typically, the level of autocorrelation will decline with an 
increasing number of lags in the chain (e.g. as we go from the 
1000th to the 1010th lags, the level of autocorrelation will 
often decline.) When this dampening doesn't occur, then you 
have a problem and will probably want to re-parameterize the 
model (more on this below). 


33 / 46 


Example of Model with autocorrelation 


mu.b[1] chains 1:2 


10} 
2n 
00 b lila... mein a, 


Ë Jathshala 
Example of Model without autocorrelation *e"S UTOSITeIT 


betaD[1] 


Í Ak athshala 
Running Means of parameters 9 uTóSITelT 


» Running means: Once you have taken enough draws to 
summarize the posterior distribution, then if the model has 
converged, further samples from a parameter's posterior 
distribution should not influence the calculation of the mean. 


MHRD 


Iil Govt. of India 


@Qethshala 
Running Means of parameters QUST 


> Running means: Once you have taken enough draws to 
summarize the posterior distribution, then if the model has 
converged, further samples from a parameter's posterior 
distribution should not influence the calculation of the mean. 


> A plot of the average draw from the conditional distribution of 
draws 1 through t against t is useful for identifying 
convergence. 


36 / 46 


a Jathshala 
oS UTOSITeIT 


Running Means of parameters 


» Running means: Once you have taken enough draws to 
summarize the posterior distribution, then if the model has 
converged, further samples from a parameter's posterior 
distribution should not influence the calculation of the mean. 


> A plot of the average draw from the conditional distribution of 
draws 1 through t against t is useful for identifying 
convergence. 


» Note: that you could probably get the same effect with a 
traceplot. 


36 / 46 


vt. of India 


a Jathshala 
Kernel Density Plot QUST 


> Kernel density plots (a.k.a. smoothed density; histograms) 
may be useful diagnostic. 


MHRD 
Hil Govt. of India 
isto) 


@Qethshala 
Kernel Density Plot QUST 


> Kernel density plots (a.k.a. smoothed density; histograms) 
may be useful diagnostic. 


> Sometimes non-convergence is reflected in multimodal 
distributions. This is especially true if the kernel density plot 
isn't just multi-modal, but lumpy (you will know what | mean 
when you see it). 


37 / 46 


MHRD 


T ovt. of India 


Ë Jathshala 
A 


Kernel Density Plot "$ YTOSITeIT 


> Kernel density plots (a.k.a. smoothed density; histograms) 
may be useful diagnostic. 


> Sometimes non-convergence is reflected in multimodal 
distributions. This is especially true if the kernel density plot 
isn't just multi-modal, but lumpy (you will know what | mean 
when you see it). 


> When you get a lumpy posterior, it may be important to let 
the algorithm run a bit longer. Often, doing this will allow you 
to get a much more reasonable summary of the posterior. 


37 / 46 


Example of a problematic Ë athshala 
kernel density plot IIS as” 


mu.b[8] chains 1:2 sample: 128 


30.0 
20.0 
10.0 

0.0 


A more satisfactory kernel density plot would look more 
bell-shaped, though it need not be symmetric 


MHRD 
Iil Govt. of India 
KO, 


@Qethshala 
Geweke Time-Series Diagnostic +9 TOSITeIT 


> The Geweke Time-Series Diagnostic: if a model has 
converged, then if simulate a large number of draws, the mean 
(and variance) of a parameter’s posterior distribution from the 
first half of the chain will be equal to the mean (and variance) 
from the second half of the chain. 


39 / 46 


vt. of India 


Ë Oathshala 1) 
© 9 YTS AT 


Geweke Time-Series Diagnostic 
» The Geweke Time-Series Diagnostic: if a model has 
converged, then if simulate a large number of draws, the mean 
(and variance) of a parameter's posterior distribution from the 
first half of the chain will be equal to the mean (and variance) 
from the second half of the chain. 


> Technically, this statistic is based on spectral density functions 
that are beyond the purview of this course and WinBugs does 
not estimate this statistic directly, but if you export the 
CODA chain to R the programs CODA and BOA report the 
Geweke statistic. 


39 / 46 


à Jathshala 
A 


Geweke Time-Series Diagnostic "QUST 


> The Geweke Time-Series Diagnostic: if a model has 
converged, then if simulate a large number of draws, the mean 
(and variance) of a parameter's posterior distribution from the 
first half of the chain will be equal to the mean (and variance) 
from the second half of the chain. 


> Technically, this statistic is based on spectral density functions 
that are beyond the purview of this course and WinBugs does 
not estimate this statistic directly, but if you export the 
CODA chain to R the programs CODA and BOA report the 
Geweke statistic. 


» However, a perfectly reasonable way to proceed is look to see 
whether the posterior means (and variances) of your 
parameters are approximately the same for different halves of 
your simulated chain. 


MHRD 
Hil Govt. of India 
RO, 


@Qethshala 
Geweke Time-Series Diagnostic +9 ITOSITeIT 


> The Geweke Time-Series Diagnostic: if a model has 
converged, then if simulate a large number of draws, the mean 
(and variance) of a parameter’s posterior distribution from the 
first half of the chain will be equal to the mean (and variance) 
from the second half of the chain. 


40 / 46 


MHRD 


T ovt. of India 


Ë Jathshala 
A 


Geweke Time-Series Diagnostic "$ YTOSITeIT 


> The Geweke Time-Series Diagnostic: if a model has 
converged, then if simulate a large number of draws, the mean 
(and variance) of a parameter's posterior distribution from the 
first half of the chain will be equal to the mean (and variance) 
from the second half of the chain. 


> The value of this approach is that by allowing the algorithm 
to run for a very long time, it may reach areas of the posterior 
distribution that may not otherwise be reached. 


40 / 46 


208 Jathshala 
Gelman-Rubin Diagnostic «S UST 


> Gelman (especially) argues that the best way to identify 
non-convergence is to simulate multiple sequences for 
over-dispersed starting points. 


athshala 
Gelman-Rubin Diagnostic Qüsteem t 


> Gelman (especially) argues that the best way to identify 
non-convergence is to simulate multiple sequences for 
over-dispersed starting points. 


> The intuition is that the behavior of all of the chains should 
be basically the same. 


MHRD 


Hi Govt. of India 


Ë Jathshala 
Gelman-Rubin Diagnostic QUITS eT 


> Gelman (especially) argues that the best way to identify 
non-convergence is to simulate multiple sequences for 
over-dispersed starting points. 


> The intuition is that the behavior of all of the chains should 
be basically the same. 


> Or, as Gelman and Rubin put it, the variance within the 
chains should be the same as the variance across the chains. 


41 / 46 


Ë Jathshala 
A 


Gelman-Rubin Diagnostic "$ YTOSITeIT 


> Gelman (especially) argues that the best way to identify 
non-convergence is to simulate multiple sequences for 
over-dispersed starting points. 


> The intuition is that the behavior of all of the chains should 
be basically the same. 


» Or, as Gelman and Rubin put it, the variance within the 
chains should be the same as the variance across the chains. 


> | think this can be diagnosed pretty easily through traceplots 
of multiple chains. You want to see if it looks like the mean 
and the variance of the two chains are the same. 


41 / 46 


Examples where convergence 
seems reasonable 


'"MHRD 


ovt. of India 


alpha0 chains 1:2 


iteration 


42 j 46 


Examples where convergence 
seems unreasonable 


alpha chains 1:2 


iteration 


Gelman-Rubin Diagnostic 


> The Gelman-Rubin statistic is based on the following 
procedure: 


athshala 
Gelman-Rubin Diagnostic estet 


> The Gelman-Rubin statistic is based on the following 
procedure: 
1. estimate your model with a variety of different initial values 
and iterate for an n-iteration burn-in and an n-iteration 
monitored period. 


athshala 
Gelman-Rubin Diagnostic estet 


> The Gelman-Rubin statistic is based on the following 
procedure: 

1. estimate your model with a variety of different initial values 
and iterate for an n-iteration burn-in and an n-iteration 
monitored period. 

2. take the n-monitored draws of m parameters and calculate the 
following statistics: 


Ë Jathshala 
A 


Gelman-Rubin Diagnostic 9 WTSI aT 


> The Gelman-Rubin statistic is based on the following 
procedure: 

1. estimate your model with a variety of different initial values 
and iterate for an n-iteration burn-in and an n-iteration 
monitored period. 

2. take the n-monitored draws of m parameters and calculate the 
following statistics: 

2.1 Within chain variance W = a iy peu ye , (05 — 8? 


Ëk athshala 
e 


Gelman-Rubin Diagnostic 9 WTSI aT 


> The Gelman-Rubin statistic is based on the following 
procedure: 

1. estimate your model with a variety of different initial values 
and iterate for an n-iteration burn-in and an n-iteration 
monitored period. 

2. take the n-monitored draws of m parameters and calculate the 
following statistics: 

2.1 Within chain variance W = aa T ep a (05 — 8? 


2.2 Between chain variance B = <%= 3 ca — 6)? 


@Qethshala 
Gelman-Rubin Diagnostic QUITS 


> The Gelman-Rubin statistic is based on the following 
procedure: 

1. estimate your model with a variety of different initial values 
and iterate for an n-iteration burn-in and an n-iteration 
monitored period. 

2. take the n-monitored draws of m parameters and calculate the 
following statistics: 

2.1 Within chain variance W = aa 3 ci yi, (65 — 85)? 


m(m—1 


2.2 Between chain variance B = —"4 3a -8y 


2.3 Estimated variance V (0) = (1-i)w -iB 


athshala 
Gelman-Rubin Diagnostic estet 


> The Gelman-Rubin statistic is based on the following 
procedure: 

1. estimate your model with a variety of different initial values 
and iterate for an n-iteration burn-in and an n-iteration 
monitored period. 

2. take the n-monitored draws of m parameters and calculate the 
following statistics: 


2.1 Within chain variance W — aa 3 ci yi, (65 — 85)? 
2.2 Between chain variance B = —"— 3a -8y 


m-—1 


2.3 Estimated variance V (0) = 1-i)w iB 


2.4 Gelman-Rubin Statistics: VR = ac 


44 | 46 


208 Jathshala 
Gelman Rubin Diagnostic «S UTOSITeIT 


> Before convergence, W underestimates total posterior variance 
in 0 because it has not fully explored the target distribution. 


Ë Jathshala 
© S WOM AT 


Gelman Rubin Diagnostic 


> Before convergence, W underestimates total posterior variance 
in 0 because it has not fully explored the target distribution. 


> V(8) on the other hand overestimates variance in 0 because 
the starting points are over-dispersed relative to the target. 


Ë Jathshala 
oS UTOSITeIT 


Gelman Rubin Diagnostic 


> Before convergence, W underestimates total posterior variance 
in 0 because it has not fully explored the target distribution. 


> V(8) on the other hand overestimates variance in 0 because 
the starting points are over-dispersed relative to the target. 


> Once convergence is reached, W and V (0) should be almost 
equivalent because variation within the chains and variations 
between the chains should coincide, so R should approximately 
equal one. 


45 / 46 


Ë Jathshala 
A 


Gelman Rubin Diagnostic “HUTS AT 


> Before convergence, W underestimates total posterior variance 
in 0 because it has not fully explored the target distribution. 


> V(8) on the other hand overestimates variance in 0 because 
the starting points are over-dispersed relative to the target. 


> Once convergence is reached, W and V (0) should be almost 
equivalent because variation within the chains and variations 
between the chains should coincide, so R should approximately 
equal one. 


» The drawback of this stat is that its value depends a great 
deal on the choice of initial values. 


45 / 46 


Ë Jathshala 
Convergence Diagnostic Summary +9 TST eT 


» You can never prove that something has converged, you can 
only tell when something has not converged. 


MHRD 
HiK (Govt. of India 
EO, 


@Qethshala 
Convergence Diagnostic Summary ~ HY ITOSITeIT 


> You can never prove that something has converged, you can 
only tell when something has not converged. 

» |f your model has not converged and you are confident that 
YOU have not made any mistake, then the best thing is to 
just let the model run a long time. CPU time is often 
"cheaper" than your time. 


MHRD 


J]IX Govt. of India 


Ë Jathshala 
A 


Convergence Diagnostic Summary A urssms 


> You can never prove that something has converged, you can 
only tell when something has not converged. 

» |f your model has not converged and you are confident that 
YOU have not made any mistake, then the best thing is to 
just let the model run a long time. CPU time is often 
"cheaper" than your time. 

> For models with large numbers of parameters you should let 
the model run for a long time. 


Ë Jathshala 
A 


Convergence Diagnostic Summary "QUTSTAT 


> You can never prove that something has converged, you can 
only tell when something has not converged. 


» |f your model has not converged and you are confident that 
YOU have not made any mistake, then the best thing is to 
just let the model run a long time. CPU time is often 
"cheaper" than your time. 


> For models with large numbers of parameters you should let 
the model run for a long time. 

> There are a number of easy way to implement tricks (mostly 
reparamerizations) that will help to speed convergence in 
most regression-based models. 


à Jathshala 
A 


Convergence Diagnostic Summary "QUTSTAT 


> You can never prove that something has converged, you can 
only tell when something has not converged. 


» |f your model has not converged and you are confident that 
YOU have not made any mistake, then the best thing is to 
just let the model run a long time. CPU time is often 
"cheaper" than your time. 

> For models with large numbers of parameters you should let 
the model run for a long time. 

> There are a number of easy way to implement tricks (mostly 
reparamerizations) that will help to speed convergence in 
most regression-based models. 

> Convergence does not mean that you have a good 


model!!! Convergence should be the beginning of model 
assessment, not its end. 


46 / 46 


athshala 
c |) ITO SITeIT 


A Gateway to all Postgraduate Courses An MHRD project under its National Mission on Education through ICT (NME-ICT) 


Subject: Statistics 


Paper: Statistical Inference 
Module: Bayesian Hypothesis Testing and 
Bayes Factors 


MHRD 


Iil Govt. of India 


@Qethshala 
Development Team oS UST AT 


Principal investigator: Dr. Bhaswati Ganguli, Professor, 
Department of Statistics, University of Calcutta 

Paper co-ordinator: Dr. Dipak K Dey, Associate Dean and BOT 
Distinguished Professor, Department of Statistics, 
University of Connecticut 

Content writer: Dr. Sourish Das, Assistant Professor, Chennai 
Mathematical Institute 


Content reviewer: Department of Statistics, University of Calcutta 


Outline 


1. Bayesian p-values 


Outline 


1. Bayesian p-values 


2. Bayes Factors for model comparison 


08 Jathshala 
Outline v: 9 UTOSITeIT 


1. Bayesian p-values 


2. Bayes Factors for model comparison 


3. Easy to implement alternatives for model comparison 


vt. of India 


Ë Jathshala 
Bayesian Hypothesis Testing QUST 


> Bayesian hypothesis testing is less formal than frequentist 
approach. 


vt. of India 


Ak athshala 
Bayesian Hypothesis Testing ) STS RIT AT 


> Bayesian hypothesis testing is less formal than frequentist 
approach. 


» |n fact, Bayesian researchers typically summarize the posterior 
distribution without applying a rigid decision process. 


MHRD 


T ovt. of India 


@Qethshala 

Bayesian Hypothesis Testing eS UST ll 

> Bayesian hypothesis testing is less formal than frequentist 
approach. 


> In fact, Bayesian researchers typically summarize the posterior 
distribution without applying a rigid decision process. 


> If one wanted to apply a formal process, Bayesian decision 
theory is the way to go because it is possible to get a 
probability distribution over the parameter space and one can 
make expected utility calculations based on the costs and 
benefits of different outcomes. 


Ë Jathshala 
A 


Bayesian Hypothesis Testing “HUTS AT 


> Bayesian hypothesis testing is less formal than frequentist 
approach. 


» |n fact, Bayesian researchers typically summarize the posterior 
distribution without applying a rigid decision process. 


> If one wanted to apply a formal process, Bayesian decision 
theory is the way to go because it is possible to get a 
probability distribution over the parameter space and one can 
make expected utility calculations based on the costs and 
benefits of different outcomes. 


» Considerable energy has been given, however, in trying to map 
Bayesian statistical models into the null hypothesis hypothesis 
testing framework, with mixed results at best. 


Similarities between Bayesian and 
Frequentist Hypothesis Testing 


» Maximum likelihood estimates of parameter means and 
standard errors and Bayesian estimates with flat priors are 
equivalent. 


MHRD 


ovt. of India 


Similarities between Bayesian and Si^ athshala 
Frequentist Hypothesis Testing al 


» Maximum likelihood estimates of parameter means and 
standard errors and Bayesian estimates with flat priors are 
equivalent. 


» Asymptotically, the data will overwhelm the choice of prior, so 
if we had infinite data sets, priors would be irrelevant and 
Bayesian and frequentist results would converge. 


athshala 
OTST 


MHRD 


ovt. of India 


Similarities between Bayesian and ËR 
Frequentist Hypothesis Testing 3E 


» Maximum likelihood estimates of parameter means and 
standard errors and Bayesian estimates with flat priors are 
equivalent. 


» Asymptotically, the data will overwhelm the choice of prior, so 
if we had infinite data sets, priors would be irrelevant and 
Bayesian and frequentist results would converge. 


> Frequentist one-tailed tests are basically equivalent to what a 
Bayesian would get using credible intervals. 


Differences between Frequentist and a athshala 
Bayesian Hypothesis Testing al 


> The most important pragmatic difference between Bayesian 
and frequentist hypothesis testing is that Bayesian methods 
are poorly suited for two-tailed tests. 


MHRD 


ovt. of India 


Differences between Frequentist and ËR athshala 
Bayesian Hypothesis Testing SEIT 


> The most important pragmatic difference between Bayesian 
and frequentist hypothesis testing is that Bayesian methods 
are poorly suited for two-tailed tests. 


» Why? Because the probability of zero in a continuous 
distribution is zero. 


MHRD 


ovt. of India 


Differences between Frequentist and ËR athshala 
Bayesian Hypothesis Testing SEIT 


» The most important pragmatic difference between Bayesian 
and frequentist hypothesis testing is that Bayesian methods 
are poorly suited for two-tailed tests. 


» Why? Because the probability of zero in a continuous 
distribution is zero. 
The best solution proposed so far is to calculate the probability 
that, say, a regression coefficient is in some range near zero. 


MHRD 


ovt. of India 


Differences between Frequentist and ËR athshala 
Bayesian Hypothesis Testing SEIT 


» The most important pragmatic difference between Bayesian 
and frequentist hypothesis testing is that Bayesian methods 
are poorly suited for two-tailed tests. 


» Why? Because the probability of zero in a continuous 
distribution is zero. 
The best solution proposed so far is to calculate the probability 
that, say, a regression coefficient is in some range near zero. 
e.g., two-sided p-value- P(—e < B < e) 


MHRD 


ovt. of India 


Differences between Frequentist and a6 athzhnia 
Bayesian Hypothesis Testing 3e LK 


WTOSITeIT 


» The most important pragmatic difference between Bayesian 
and frequentist hypothesis testing is that Bayesian methods 
are poorly suited for two-tailed tests. 


» Why? Because the probability of zero in a continuous 
distribution is zero. 
The best solution proposed so far is to calculate the probability 
that, say, a regression coefficient is in some range near zero. 
e.g., two-sided p-value- P(—e < B < e) 


» However, the choice of e seems very ad hoc unless there is 
some decision theoretic basis. 


MHRD 


ovt. of India 


Differences between Frequentist and ËR athshala 
Bayesian Hypothesis Testing "$ YTOSITeIT 


ə 
» The most important pragmatic difference between Bayesian 


and frequentist hypothesis testing is that Bayesian methods 
are poorly suited for two-tailed tests. 


» Why? Because the probability of zero in a continuous 
distribution is zero. 
The best solution proposed so far is to calculate the probability 
that, say, a regression coefficient is in some range near zero. 
e.g., two-sided p-value- P(—e < B < e) 


» However, the choice of e seems very ad hoc unless there is 
some decision theoretic basis. 


> The other important difference is more philosophical. 
Frequentist p-values violate the likelihood principle. 


208 Jathshala 
Bayes Factors «S UST 


> Bayes Factors are the dominant method of Bayesian model 
testing. They are the Bayesian analogues of likelihood ratio 
tests. 


MHRD 


Hi Govt. of India 


a Jathshala 
Bayes Factors QUITS eT 


> Bayes Factors are the dominant method of Bayesian model 
testing. They are the Bayesian analogues of likelihood ratio 
tests. 


> The basic intuition is that prior and posterior information are 
combined in a ratio that provides evidence in favor of one 
model specification verses another. 


Ë Jathshala 
A 


Bayes Factors “HUTS AT 


> Bayes Factors are the dominant method of Bayesian model 
testing. They are the Bayesian analogues of likelihood ratio 
tests. 


> The basic intuition is that prior and posterior information are 
combined in a ratio that provides evidence in favor of one 
model specification verses another. 


» Bayes Factors are very flexible, allowing multiple hypotheses 
to be compared simultaneously and nested models are not 
required in order to make comparisons —> it goes without 
saying that compared models should obviously have the same 
dependent variable. 


208 Jathshala 
The General Form for Bayes Factors «S UTOSITeIT 


» Suppose that we observe data X and with to test two 
competing models - Mı and Mo» relating these data to two 
different sets of parameters 04 and 05. 


athshala 
The General Form for Bayes Factors Getestet 


» Suppose that we observe data X and with to test two 
competing models - Mı and Mz relating these data to two 
different sets of parameters 04 and 05. 


» We would like to know which of the following likelihood 
specifications is better: 


Mj : fi(z|01) and Mo : fo(x|02) 


a Jathshala 
The General Form for Bayes Factors +9 QTOSITeIT 


» Suppose that we observe data X and with to test two 
competing models - Mı and M^» relating these data to two 
different sets of parameters 01 and 05. 


» We would like to know which of the following likelihood 
specifications is better: 


Mj : fi(z|01) and Mo : fo(x|62) 


> Obviously, we would need prior distributions for the 01 and 65 
and prior probabilities for Mı and M» 


The General Form for Bayes Factors 


> The posterior odds ratio in favor of M, over M» is: 
Posterior Odds- Prior Odds/Data x Bayes factor 


The General Form for Bayes Factors 


> The posterior odds ratio in favor of Mı over M» is: 
Posterior Odds- Prior Odds/Data x Bayes factor 


r(Miz) _ p(Mi)/p(r) .. So, fi (]&1)pi (01)d0: 
7(Mo|x) | p(Ma)/p(z) ^ fo, f2(x|02)p2(02)d0» 


Ë Jathshala 
v. 


The General Form for Bayes Factors 9 uTóSITelT 


> The posterior odds ratio in favor of Mı over Mg is: 
Posterior Odds- Prior Odds/Data x Bayes factor 


r(Miz) _ p(Mi)/p(r) .. So, fi (]&1)pi (01)d0: 
7(Mo|x) | p(Ma)/p(z) ^ fo, f(x|02)p2(02)d0» 


> Rearranging terms, we find the Bayes’ factor is: 


Bayes Factor = B(x) = WE 


athshala $ 
The General Form for Bayes Factors ont as” 


> The posterior odds ratio in favor of M1 over Mg is: 
Posterior Odds=Prior Odds/Data x Bayes factor 


TM a x Jo, F1(2101)pı (81)d01 


T(Mə|z) p(M2ə)/p(x) * To, fat x|62)p2(02)d02 


> Rearranging terms, we find the Bayes’ factor is: 


(My |x 
7(M3|x 


Bayes Factor = B(x) = 


> |f we have nested models and P(M1) = P(M2) = 0.5 then 
the Bayes factor reduces to likelihood ratio 


Rule of Thumb 


> Bayes Factor : 


10/17 


athshala $ RD 
Rule of Thumb C EET ^ 


» Bayes Factor : 


» With this setup, if we interpret model 1 as the null model, 
then: 


10/17 


athshala RD 
Rule of Thumb eese ! p 


» Bayes Factor : 


» With this setup, if we interpret model 1 as the null model, 
then: 


1. If B(x) > 1 then model 1 is supported 


10/17 


t. of India 


athshala 
Rule of Thumb Getestet 


» Bayes Factor : 


» With this setup, if we interpret model 1 as the null model, 
then: 


1. If B(x) > 1 then model 1 is supported 


2. If 1 > B(x) > 10-!/? then minimal evidence against model 1. 


10/17 


t. of India 


Rule of Thumb 


> Bayes Factor : 


> With this setup, if we interpret model 1 as the null model, 
then: 


1. If B(x) > 1 then model 1 is supported 
2. If 1 > B(x) > 10-!/? then minimal evidence against model 1. 


3. If 1071/2 > B(x) > 107} then substantial evidence against 
model 1. 


10/17 


vt. of India 


@Qethshala 
Rule of Thumb © 9 YTS AT 


» Bayes Factor : 


_ T(Mı|z)/p(Mı) 
"(M»|v)/p(M2) 


B(x) 


» With this setup, if we interpret model 1 as the null model, 
then: 


1. If B(x) > 1 then model 1 is supported 
2. If 1 > B(x) > 10-!/? then minimal evidence against model 1. 


3. If 1071/2 > B(x) > 107} then substantial evidence against 
model 1. 


4. If 107! > B(x) > 107? then strong evidence against model 1. 


10/17 


vt. of India 


Ë Jathshala 
Rule of Thumb © €) UTOSITeIT 


> Bayes Factor : 


_ T(Mı|z)/p(Mı) 
"(M»|v)/p(M2) 


B(x) 


» With this setup, if we interpret model 1 as the null model, 
then: 


1. If B(x) > 1 then model 1 is supported 
2. If 1 > B(x) > 10-!/? then minimal evidence against model 1. 


3. If 1071/2 > B(x) > 107! then substantial evidence against 
model 1. 


4. If 107! > B(x) > 107? then strong evidence against model 1. 


5. If 107? > B(x) then decisive evidence against model 1. 


10/17 


208 Jathshala 
The Bad News v 9 UTSATT 


» Unfortunately, while Bayes Factors are rather intuitive, as a 
practical matter they are often quite difficult to calculate. 


11/17 


Ë Jathshala 
v. 


The Bad News S uTo SITelT 


> Unfortunately, while Bayes Factors are rather intuitive, as a 
practical matter they are often quite difficult to calculate. 


» However, in the MCMCpack package Bayes Factor can be 
computed routinely for standard statistical models 


11/17 


MHRD 


T ovt. of India 


Ë Jathshala 
A 


The Bad News "S UTOSITeIT 


> Unfortunately, while Bayes Factors are rather intuitive, as a 
practical matter they are often quite difficult to calculate. 


» However, in the MCMCpack package Bayes Factor can be 
computed routinely for standard statistical models 


> You also may want to use Carlin and Chib's technique for 
computing Bayes Factors for competing non-nested regression 
models reported in Journal of Royal Statistical Society. Series 
B. vol 57:3 1995. 


11/17 


Ë Jathshala 
The Bad News Q 


“YTS AT 


» Unfortunately, while Bayes Factors are rather intuitive, as a 
practical matter they are often quite difficult to calculate. 


» However, in the MCMCpack package Bayes Factor can be 
computed routinely for standard statistical models 


> You also may want to use Carlin and Chib's technique for 
computing Bayes Factors for competing non-nested regression 


models reported in Journal of Royal Statistical Society. Series 
B. vol 57:3 1995. 


> Our discussion will focus on alternatives to the Bayes Factor. 


TEIREUT 


MHRD 


ovt. of India 


Alternatives to the Bayes Factor Si^ athshala 
for model assessmen al 


> Let 0* denote your estimates of the parameter means (or 
medians or modes) in your model and suppose that the Bayes 
estimate is approximately equal to the maximum likelihood 
estimate, then the following stats used in frequentist statistics 
will be useful diagnostics. 


MHRD 


ovt. of India 


Alternatives to the Bayes Factor Si^ athshala 
for model assessmen a EID 


> Let 0* denote your estimates of the parameter means (or 
medians or modes) in your model and suppose that the Bayes 
estimate is approximately equal to the maximum likelihood 
estimate, then the following stats used in frequentist statistics 
will be useful diagnostics. 


» Good: The Likelihood Ratio 


Ratio — —2[log L (Restricted ModellY) m log LO rut ModeilV)] 


jJ MHRD 


ovt. of India 


Alternatives to the Bayes Factor S )athshala 

for model assessmen © S UTOSITeIT 

> Let 0* denote your estimates of the parameter means (or 
medians or modes) in your model and suppose that the Bayes 
estimate is approximately equal to the maximum likelihood 
estimate, then the following stats used in frequentist statistics 
will be useful diagnostics. 


» Good: The Likelihood Ratio 


Ratio — —2[log L(ORestricted ModellY) m log L(Orun ModeilV)] 


> This statistic will always favor the unrestricted model, but 
when the Bayes estimators or equivalent to the maximum 
likelihood estimates, then the Ratio is distributed as a X? 
where the number of degrees of freedom is equal to the 
number of test parameters. 


MHRD 


ovt. of India 


Alternatives to the Bayes Factor Si^ athshala 
for model assessmen a 


> Let 0* denote your estimates of the parameter means (or 
medians or modes) in your model and suppose that the Bayes 
estimate is approximately equal to the maximum likelihood 
estimate, then the following stats used in frequentist statistics 
will be useful diagnostics. 


13/17 


MHRD 


ovt. of India 


Alternatives to the Bayes Factor Si^ athshala 
for model assessmen IIS 


> Let 0* denote your estimates of the parameter means (or 
medians or modes) in your model and suppose that the Bayes 
estimate is approximately equal to the maximum likelihood 
estimate, then the following stats used in frequentist statistics 
will be useful diagnostics. 


» Better: Akaike Information Criterion (AIC) 
AIC = —2log L(0"|y) + 2p 


13/17 


MHRD 


ovt. of India 


Alternatives to the Bayes Factor Si^ athshala 
for model assessmen Kaew 


> Let 0* denote your estimates of the parameter means (or 
medians or modes) in your model and suppose that the Bayes 
estimate is approximately equal to the maximum likelihood 
estimate, then the following stats used in frequentist statistics 
will be useful diagnostics. 


» Better: Akaike Information Criterion (AIC) 
AIC = —2log L(0"|y) + 2p 


> where p = the number of parameters including the intercept. 


13/17 


Alternatives to the Bayes Factor Si^ athshala 
for model assessmen “OQ UTOSITeIT 


> Let 0* denote your estimates of the parameter means (or 
medians or modes) in your model and suppose that the Bayes 
estimate is approximately equal to the maximum likelihood 
estimate, then the following stats used in frequentist statistics 
will be useful diagnostics. 


» Better: Akaike Information Criterion (AIC) 
AIC = —2log L(0"|y) + 2p 


> where p = the number of parameters including the intercept. 


> To compare two models, compare the AIC from model 1 
against the AIC from model 2. 


13/17 


Alternatives to the Bayes Factor 
for model assessmen 


» Models do not need to be nested 


14/17 


MHRD 


ovt. of India 


Alternatives to the Bayes Factor Si^ athshala 
for model assessmen al 


» Models do not need to be nested 
» The AIC tends to be biased in favor of more complicated 


models, because the log-likelihood tends to increase faster 
than the number of parameters. 


14 / 17 


Alternatives to the Bayes Factor Gi athshala MHRD 
e [e] 


for model assessmen Ju 


» Models do not need to be nested 


> The AIC tends to be biased in favor of more complicated 
models, because the log-likelihood tends to increase faster 
than the number of parameters. 
> Bayesian Information Criterion (BIC): 
BIC = —21log L(0*|y) + p x log(n) 


where p is the number of parameters and n is the sample size. 


14 / 17 


Alternatives to the Bayes Factor Si^ athshala 
for model assessmen al 


» Models do not need to be nested 
> The AIC tends to be biased in favor of more complicated 
models, because the log-likelihood tends to increase faster 


than the number of parameters. 


> Bayesian Information Criterion (BIC): 
BIC = —21log L(0*|y) + p x log(n) 


where p is the number of parameters and n is the sample size. 


> This statistic can also be used for non-nested models 


14/17 


MHRD 


cout. of India 


Alternatives to the Bayes Factor ai athshala 
for model assessmen II 


» Models do not need to be nested 


> The AIC tends to be biased in favor of more complicated 
models, because the log-likelihood tends to increase faster 
than the number of parameters. 


> Bayesian Information Criterion (BIC): 
BIC = —21log L(0*|y) + p x log(n) 


where p is the number of parameters and n is the sample size. 
> This statistic can also be used for non-nested models 
> BIC, — BIC = —2log(Bayes Factor;2) for Model 1 vs 


Model 2 


t. of India 


i . ÈR athshala 
Application ) TOSITeIT 
» Consider 'birthwt' dataset available in MASS package of R 


» The dataset tries to look for the risk factors associated with 
low infant birth weight. 


15/17 


athshala 
Application eder 


» Consider 'birthwt' dataset available in MASS package of R 


» The dataset tries to look for the risk factors associated with 
low infant birth weight. 


1 indicator of birth weight less than 2.5 kg. 
low; = . 
0 otherwise 


15/17 


Ë Jathshala 
Application SHST ay” 


» Consider 'birthwt' dataset available in MASS package of R 


» The dataset tries to look for the risk factors associated with 
low infant birth weight. 


{ 1 indicator of birth weight less than 2.5 kg. 
low; = . 

0 otherwise 

i — 1,2...,189 


zi = Bo + Bi Age; + 2I (race; = black) 
+ (3I(race; = others) + B4I(Smoke; = yes) + ej 


15/17 


Ë Jathshala 
Application SHST ay” 


» Consider 'birthwt' dataset available in MASS package of R 


» The dataset tries to look for the risk factors associated with 
low infant birth weight. 


{ 1 indicator of birth weight less than 2.5 kg. 
low; = . 

0 otherwise 

i — 1,2...,189 


zi = Bo + Bi Age; + Bol (race; = black) 
+ (3I(race; = others) + B4I(Smoke; = yes) + e; 


ej ~ N(0,1) and 
P(low = 1|Age, race, smoke) = P(z > 0|Age, race, smoke) 


15/17 


@Qethshala 
Q 


"QUSMA d 


JJ! Govt. of India 
treed 


Application 


Posterior Summary 


> library (MCMCpack) 

> set.seed (8135) 

> data(birthwt) 

> M1 <- MCMCprobit (low”as. factor (race) +age+smoke 
+ , data-birthwt, bO = 0, BO = 10 
+ ,narginal.likelihood-"Chib95") 
> M2 <- MCMCprobit (low~as.factor(race) +smoke 

+ , data-birthwt, bO = 0, BO = 10 
+ ,narginal.likelihood-"Chib95") 
» M3 «- MCMCprobit(low^as.factor(race) *age 

+ , data=birthwt, bO = 0, BO = 10 
+ ,narginal.likelihood-"Chib95") 


16/17 


vt. of India 


@Qethshala 
Application eS UST 
Posterior Summary 
> BF <- BayesFactor(M1, M2, M3) 
> round (BF$BF.mat,digit=3) 
M1 M2 M3 
M1 1.000 1.445 6.807 


M2 0.692 1.000 4.711 
M3 0.147 0.212 1.000 


> BF) 2 = 1.445 indicates that data occurred 1.41 times more 
likely under Model 1 (M1) than Model 2 (M2). It can be 
considered as an anecdotal evidence 


TEEPEUT 


@Qethshala Wb MHRD 
Q 


Application o UST t. of India 


Posterior Summary 


> BF <- BayesFactor(M1, M2, M3) 
> round (BF$BF.mat,digit=3) 


M1 M2 M3 
Mi 1.000 1.445 6.807 
M2 0.692 1.000 4.711 
M3 0.147 0.212 1.000 


> BF) 2 = 1.445 indicates that data occurred 1.41 times more 
likely under Model 1 (M1) than Model 2 (M2). It can be 


considered as an anecdotal evidence 


> BF 3 = 6.807 indicates that data occurred 6.81 times more 
likely under Model 1 (M1) than Model 3 (M3). It can be 
considered as moderate evidence 


nE 


EQathshala 
A 


Application "S UTOSITelIT 


Posterior Summary 


> BF <- BayesFactor(M1, M2, M3) 
> round (BF$BF.mat ,digit=3) 


M1 M2 M3 
M1 1.000 1.445 6.807 
M2 0.692 1.000 4.711 
M3 0.147 0.212 1.000 


> BF 2 = 1.445 indicates that data occurred 1.41 times more 
likely under Model 1 (M1) than Model 2 (M2). It can be 
considered as an anecdotal evidence 

> BF 3 = 6.807 indicates that data occurred 6.81 times more 
likely under Model 1 (M1) than Model 3 (M3). It can be 
considered as moderate evidence 

> BF 3 = 4.711 indicates that data occurred 4.71 times more 
likely under Model 2 (M2) than Model 3 (M3). 


epee 


athshala 
GR |) OTST 


A Gateway to all Postgraduate Courses An MHRD project under its National Mission on Education through ICT (NME-ICT) 


Subject: Statistics 


Paper: Statistical Inference 
Module: Advanced Hierarchical Models - Part 
1 


MHRD 


Hi Govt. of India 


@Qethshala 
Development Team eS UST 


Principal investigator: Dr. Bhaswati Ganguli, Professor, 
Department of Statistics, University of Calcutta 

Paper co-ordinator: Dr. Dipak K Dey, Associate Dean and BOT 
Distinguished Professor, Department of Statistics, 
University of Connecticut 

Content writer: Dr. Sourish Das, Assistant Professor, Chennai 
Mathematical Institute 


Content reviewer: Department of Statistics, University of Calcutta 


Outline 


1. The intuition behind hierarchical regression models 


vt. of India 


Ak athshala 
Outline ) ATO SITeIT 


1. The intuition behind hierarchical regression models 


2. Setting up probability models for hierarchical regressions 


208 Jathshala 
Overview S UTOSITeIT 


» Hierarchical data is ubiquitous in the social sciences where 
measurement occurs at different levels of aggregation. 


athshala 
Overview eder 


» Hierarchical data is ubiquitous in the social sciences where 
measurement occurs at different levels of aggregation. 


> e.g. we collect measurements by geographic region or social 
group. 


MHRD 


Jy ovt. of India 


. e Jathshala 

Overview © QUST 

> Hierarchical data is ubiquitous in the social sciences where 
measurement occurs at different levels of aggregation. 


> e.g. we collect measurements by geographic region or social 
group. 


> Hierarchical models provide a way of examining differences 
across populations. They pool the information for the 
disparate groups without assuming that they belong to 
precisely the same population. 


à Jathshala 
A 


Overview "X UTOM aT 


> Hierarchical data is ubiquitous in the social sciences where 
measurement occurs at different levels of aggregation. 


> e.g. we collect measurements by geographic region or social 
group. 


> Hierarchical models provide a way of examining differences 
across populations. They pool the information for the 
disparate groups without assuming that they belong to 
precisely the same population. 


> In the context of regression analyses, hierarchical models allow 
us to examine whether the extent to which regression 
coefficients vary across different sub-populations, while 
borrowing strength from the full sample. 


4/23 


MHRD 
JJ Govt. of India 
RO 


Ak athshala 
Example OTS air alr 


aiia dh 


> The importance of uncertainty about the Democratic Party's 
ideology for its electoral success during the Jacksonian era 
(circa 1840). 


Ak athshala 
Example 9 uTO SITeIT 


> The importance of uncertainty about the Democratic Party's 
ideology for its electoral success during the Jacksonian era 
(circa 1840). 


» Dependent variable: 


MHRD 


Iil Govt. of India 


e Jathshala 
Example ee UST aT 


> The importance of uncertainty about the Democratic Party's 
ideology for its electoral success during the Jacksonian era 
(circa 1840). 


> Dependent variable: 
> Percentage of seats won by the Democratic Party in the House 
of Representatives in United States in state i in election t. 


"Duthshala 


z 
Example Q 9 uTO SITeIT 


MHRD 
J]IM (Govt. of India 


» The importance of uncertainty about the Democratic Party's 
ideology for its electoral success during the Jacksonian era 
(circa 1840). 


» Dependent variable: 


> Percentage of seats won by the Democratic Party in the House 
of Representatives in United States in state i in election t. 


» Independent variable: 


Ë Jathshala 
Example ee UTS TAT 


> The importance of uncertainty about the Democratic Party's 
ideology for its electoral success during the Jacksonian era 
(circa 1840). 


» Dependent variable: 


> Percentage of seats won by the Democratic Party in the House 
of Representatives in United States in state i in election t. 


» Independent variable: 


> Level of ideological conflict within state /'s Democratic Party 
delegation to the House in period £ — 1. 


MHRD 


IW Govt. of India 
) 


Ë Jathshala 
e e UTOSITeIT 


Example 
» The importance of uncertainty about the Democratic Party's 


ideology for its electoral success during the Jacksonian era 
(circa 1840). 


» Dependent variable: 


> Percentage of seats won by the Democratic Party in the House 
of Representatives in United States in state i in election t. 


» Independent variable: 


> Level of ideological conflict within state ¿i's Democratic Party 
delegation to the House in period £ — 1. 


> Possible control variables include dummy variables for the 
various states measuring their preference for the Democratic 
Party and for each election. 


MHRD 


J]IM (Govt. of India 
) 


Ë Jathshala 
oS UTOSITeIT 


Example 
» The importance of uncertainty about the Democratic Party's 


ideology for its electoral success during the Jacksonian era 
(circa 1840). 


» Dependent variable: 


> Percentage of seats won by the Democratic Party in the House 
of Representatives in United States in state i in election t. 


» Independent variable: 


> Level of ideological conflict within state ¿i's Democratic Party 
delegation to the House in period £ — 1. 


> Possible control variables include dummy variables for the 
various states measuring their preference for the Democratic 
Party and for each election. 


» Key modeling question: Does the sample pool? 


IX e|————— 7 7  ——m 


Parameters of Pooled OLS Model of 
Democratic Electoral Success Due to 


Intra-Party Unity 


Dep. Var. Posteriod | Posterior 

Democratic Mean standard 

Electoral Success deviation 
Intercept 0.578* 0.024 
Ideological Conflict | —3.512* 1.192 


» Mean Squared Error: 0.08516 


* Denotes statistical significance 


athshala 
WJTOSITeIT 


MHRD 


ovt. of India 


Unpooled OLS Model a 
(Different state-specific intercepts and slop ex 


o 
© 


o 
Ex] 


o 
D 


in 


[ o 


| 


Predicted Percentage of Seats 


o 
N 


0 0.01 0.00 0.03 0.04 005 006 007 008 009 0.1 
Intra-Party Conflict 


> F-tests reject the unpooled model as statistically unwarranted; 
however, there were significant state-specific intercepts and 
coefficients suggesting that there was causal heterogeneity in 
the model. What to do? 


athshala 
Example GCs 1 


> F-tests reject the unpooled model as statistically unwarranted; 
however, there were significant state-specific intercepts and 
coefficients suggesting that there was causal heterogeneity in 
the model. 


MHRD 
J]IN (Govt. of India 
RO, 


@Qethshala 
Example eS UST 


> F-tests reject the unpooled model as statistically unwarranted; 
however, there were significant state-specific intercepts and 
coefficients suggesting that there was causal heterogeneity in 
the model. 


> In a context like this, hierarchical structures are perfect. 


a Jathshala 
A 


Example Me OTs erretr 


> F-tests reject the unpooled model as statistically unwarranted; 
however, there were significant state-specific intercepts and 
coefficients suggesting that there was causal heterogeneity in 
the model. 


> In a context like this, hierarchical structures are perfect. 
> Where differences are not statistically important, the 
state-specific coefficients are shrunk back toward the national 
average. 


Ë Jathshala 
A 


Example "Q UTSTAT 


> F-tests reject the unpooled model as statistically unwarranted; 
however, there were significant state-specific intercepts and 
coefficients suggesting that there was causal heterogeneity in 
the model. 


> In a context like this, hierarchical structures are perfect. 
» Where differences are not statistically important, the 
state-specific coefficients are shrunk back toward the national 
average. 


» Where differences are statistically meaningful, the state-specific 
effects remain markedly different from the national average. 


i Govt. of India 
^ 


Ei Jathshala 
The Hierarchical Probability Model Go 


> Electoral Successi ~ N (Mit, T) 


t. of India 


The Hierarchical Probability Model 


> Electoral Successi ~ N (Mit, T) 


> where Mi = aj + bi * Intra-Party Conflict; 4—1, 


t. of India 


The Hierarchical Probability Model 


> Electoral Successi ~ N (Mit, T) 
> where Mi = aj + bi * Intra-Party Conflict; 4—1, 


> a; ~ Norna(A, TA) for all i 


The Hierarchical Probability Model 


> Electoral Successi ~ N (Mit, T) 
> where Mi = aj + bi * Intra-Party Conflict; 4—1, 
> a; ~ Norna(A, 7A) for all i 


> A~ N(0,0.01) and 


The Hierarchical Probability Model 


> Electoral Successi ~ N (Mit, T) 
> where Mi = aj + bi * Intra-Party Conflict; 4—1, 
> a; ~ Norna(A, 7A) for all i 


> Ac N(0,0.01) and 74 ~ Gamma(0.1, 0.1) 


Ei Jathshala RD 
The Hierarchical Probability Model irri o 


> Electoral Successi ~ N (Mit, T) 


> where Mi = aj + bi * Intra-Party Conflict; 4—1, 
> a; ~ Norna(A, 7A) for all i 
> Ac N(0,0.01) and 74 ~ Gamma(0.1, 0.1) 


> ~ N(B,0.01) for all i 


athshala RD 
The Hierarchical Probability Model eder e» 


> Electoral Successi ~ N (Mit, T) 

> where Mit = a; + bi * Intra-Party Conflict; , ,, 
> a; ~ Norna(A, TA) for all i 

>» A~ N(0,0.01) and 7A ~ Gamma(0.1, 0.1) 

> ~ N(B,0.01) for all i 


> B ~ N(0,0.01) and 


athshala RD 
The Hierarchical Probability Model eder e» 


> Electoral Successi ~ N (Mit, T) 

> where Mit = a; + bi * Intra-Party Conflict; , ,, 
> a; ~ Norna(A, TA) for all i 

> A~ N(0,0.01) and 7A ~ Gamma(0.1, 0.1) 

> ~ N(B,0.01) for all ¿ 


> B~ N(0,0.01) and tg ~ Gamma(0.1, 0.1) 


athshala RD 
The Hierarchical Probability Model ette e» 


> Electoral Successi ~ N (Mit, T) 

> where Mi = aj + bi * Intra-Party Conflict; 4—1, 
> a; ~ Norna(A, TA) for all i 

> Ac N(0,0.01) and 74 ~ Gamma(0.1, 0.1) 

> ~ N(B,0.01) for all ¿ 

> B~ N(0,0.01) and tg ~ Gamma(0.1, 0.1) 


> and T ~ Gamma(0.1, 0.1) 


mu——— ———M— —-————MMMMM——— 04! ———.—-—dÀÀÓ— IJ UL CRM 


MHRD 
Iil Govt. of India 
isto) 


Ë Jathshala 
Comments © 9 OTST AT 


> The crucial difference between unpooled OLS and the 
hierarchical model is that the state-specific intercept terms 
and the coefficients for intra-party conflict are now treated as 
exchangeable draws from a common probability model with 
unknown mean and variance. 


10 / 23 


Ë Jathshala 
A 


Comments Me araen 


» The crucial difference between unpooled OLS and the 
hierarchical model is that the state-specific intercept terms 
and the coefficients for intra-party conflict are now treated as 
exchangeable draws from a common probability model with 
unknown mean and variance. 


» The posterior distributions of these state-specific parameters 
convey information about local effects. 


10 / 23 


Ë Jathshala 
A 


Comments “Q UTOM aT 


> The crucial difference between unpooled OLS and the 
hierarchical model is that the state-specific intercept terms 
and the coefficients for intra-party conflict are now treated as 
exchangeable draws from a common probability model with 
unknown mean and variance. 


> The posterior distributions of these state-specific parameters 
convey information about local effects. 


> The hyper-parameter A represents the average level of 
Democratic electoral success while 74 measures the variation 
in the party's fortunes across states. 


10 / 23 


à Jathshala 
A 


Comments "X uTOSITelT 


» The crucial difference between unpooled OLS and the 
hierarchical model is that the state-specific intercept terms 
and the coefficients for intra-party conflict are now treated as 
exchangeable draws from a common probability model with 
unknown mean and variance. 


> The posterior distributions of these state-specific parameters 
convey information about local effects. 


> The hyper-parameter A represents the average level of 
Democratic electoral success while r4 measures the variation 
in the party's fortunes across states. 


» Similarly, B is the average impact of intra-party conflict, while 
TB indicates the variation in the influence of party unity across 
states. 


208 Jathshala 
Comments $$ UTOSITeIT 


» |f the posterior distribution of the hyper-parameters reveal 
that TA = Tg = œ then pooled OLS is a special case. 


11/23 


MHRD 
Hil Govt. of India 
RO, 


e Jathshala 
Comments © 9 YTS AT 


» |f the posterior distribution of the hyper-parameters reveal 
that TA = Tg = œ then pooled OLS is a special case. 


> This is because if there is no variance (i.e. infinite precision) 
in the intercept or coefficient across states, then one should 
conclude that there are no regime effects. 


11/23 


Ë Jathshala 
oS UTOSITeIT 


Comments 


» |f the posterior distribution of the hyper-parameters reveal 
that TA = Tg = œ then pooled OLS is a special case. 


> This is because if there is no variance (i.e. infinite precision) 
in the intercept or coefficient across states, then one should 
conclude that there are no regime effects. 


> Similarly, if 74 = Tg = 0, then unpooled OLS is a special case 
because there is no underlying structure to the data across 
states. 


11/23 


Hyper-Parameters for Model of 
Democratic Electoral Success +O athshala 
Due to Intra-Party Unity s 


t. of India 


Parameter Posteriod | Posterior 
for State specific Mean standard 
Intercept deviation 
Mean of 
Stat specific 
Intercept 0.54* 0.054 
Precision of 
State Specific 
Intercept 21.2 8.101 


* denotes statistical significance 


Parameter Posteriod | Posterior 
for State specific Mean standard 
Intra-Part Coefficients deviation 
Mean of 
Stat specific 
Coefficients —2.85* 1.216 
Precision of 
State Specific 
Coefficients 3.071 4.81 


* denotes statistical significance 


13 / 23 


208 Jathshala 
State-Specific Predicted Values v 9 UTOSITeIT 


o 0.01 0.02 0.03 004 0.05 0.06 007 0.08 0.09 0.1 
Intra-P arty Ideological Conflict 


athshala 
What sort of voodoo is this? Qüsteem t 


> The explanation for why the random coefficient model had 
such a substantial impact on the parameter estimates for 
intra-party conflict was precisely because our pooling tests 
rejected the joint significance of state-specific effects. 


15/23 


MHRD 


Hi Govt. of India 


Ëk athshala 
What sort of voodoo is this? © 9 YTS AT 


> The explanation for why the random coefficient model had 
such a substantial impact on the parameter estimates for 
intra-party conflict was precisely because our pooling tests 
rejected the joint significance of state-specific effects. 


» The wild variations observed from unpooled OLS were an 
artifact of over-fitting the data based on a small number of 
observations. 


15 / 23 


Ë Jathshala 
A 


What sort of voodoo is this? “Q UTOM aT 


> The explanation for why the random coefficient model had 
such a substantial impact on the parameter estimates for 
intra-party conflict was precisely because our pooling tests 
rejected the joint significance of state-specific effects. 


> The wild variations observed from unpooled OLS were an 
artifact of over-fitting the data based on a small number of 
observations. 


> To prevent this over-fitting, the random coefficient model 
“borrowed strength” from the overall effect of the independent 
variable in order to make inferences about the state-specific 
effects. 


15 / 23 


What sort of voodoo is this? 


> The extent of this borrowing is contingent on the relative 
precision of the state-specific and overall effects. 


16 / 23 


MHRD 


Hil Govt. of India 


Ë Jathshala 
oS) TO SITeIT 


What sort of voodoo is this? 


> The extent of this borrowing is contingent on the relative 
precision of the state-specific and overall effects. 


> Thus, the regression lines became approximately parallel with 
the introduction of the random coefficient model because 
there was relatively little information provided by the 
state-specific data regarding the effect of intra-party conflict 
relative to that provided by the entire sample. 


Ë Jathshala 
o 


What sort of voodoo is this? “QUTSU aT 


> The extent of this borrowing is contingent on the relative 
precision of the state-specific and overall effects. 


> Thus, the regression lines became approximately parallel with 
the introduction of the random coefficient model because 
there was relatively little information provided by the 
state-specific data regarding the effect of intra-party conflict 
relative to that provided by the entire sample. 


> Meanwhile, the intercepts remained variant across regression 
lines, because there was sufficient state-specific data to 
establish that each state had a different predisposition in favor 
or against the Democratic Party. 


16 / 23 


Example: 


» conside the data set called cheese from the baysem package. 


17 / 23 


vt. of India 


@Qathshala 
Example: eS UST 


> conside the data set called cheese from the baysem package. 


> The data set contains marketing data of certain brand name 
processed cheese, such as the weekly sales volume 
(VOLUME), unit retail price (PRICE), and display activity 
level (DISP) in various regional retailer accounts. 


17 / 23 


athshala 
Example: Aeara f 


> For each account, we can define the following linear regression 
model of the log sales volume, where £4 is the intercept term, 
£5 is the display measure coefficient, and 65 is the log price 
coefficient. 


18 / 23 


MHRD 


Iil Govt. of India 


@Qathshala 
Example: eS UST AT 


> For each account, we can define the following linear regression 
model of the log sales volume, where £4 is the intercept term, 
£5 is the display measure coefficient, and 65 is the log price 
coefficient. 


log(Volume) = 81 + b2 * Display + 83 * log( Price) + € 


€ relies on regional market conditions, and we would not 
expect it to have the same dispersion among retailers. 


18 / 23 


Ë Jathshala 
A 


Example: "S UTS AT 


> For each account, we can define the following linear regression 
model of the log sales volume, where 3; is the intercept term, 
£5 is the display measure coefficient, and 65 is the log price 
coefficient. 


log(Volume) = 81 + b2 * Display + 83 * log( Price) + € 


€ relies on regional market conditions, and we would not 
expect it to have the same dispersion among retailers. 


> For the same reason, we cannot expect identical regression 
coefficients for all accounts, or attempt to define a single 
linear regression model for the entire data set. 


18 / 23 


@Qethshala 
Example: eS UTOSITelIT 


> we do expect regression coefficients of the retailer accounts to 
be related. A common approach to simulate the relationship is 
the hierarchical linear model, which treats the regression 
coefficients as random variables of yet another linear 
regression at the system level. 


19 / 23 


Ë Jathshala 
A 


Example: "$ UTSTAT 


> we do expect regression coefficients of the retailer accounts to 
be related. A common approach to simulate the relationship is 
the hierarchical linear model, which treats the regression 
coefficients as random variables of yet another linear 
regression at the system level. 


» Problem: 


Fit the data set cheese with the hierarchical linear model, and 
estimate the average impact on sales volumes of the retailers 
if the unit retail price is to be raised by 5% 


19 / 23 


vt. of India 


e Jathshala 
oS) WTO AT 


Example: 


> Let i to be an integer between 1 and the number of retailer 
accounts. We define a filter for the i? account as follows. 
> library (bayesm) 
> data(cheese) 
> retailer«-levels(cheese$RETAILER) 
> nreg<-length (retailer) 


Examples: 


MHRD 


vt. of India 


Unt 


GU 


EQathshala 
oS UTOSITeIT 


> We now loop through the accounts, and create a list of data 
items consisting of the X and y components of the linear 
regression model in each account. The columns of X below 
contains the intercept placeholder, the display measure, and 
log price data. 
> regdata«-NULL 
> for (i in 1:nreg) { 


+ 


++ +++ 


filter <- cheese$RETAILER==retailer [i] 

y <- log(cheese$VOLUME [filter] ) 

X <- cbind(1, # intercept placeholder 
cheese$DISP[filter], 
log(cheese$PRICE[filter])) 

regdata[[i]] «- list(y=y, X-X) 


EQathshala 
eS UTOSITeIT 


J]IN(Govt. of India 
ORI 


Example: 


» We wrap the regdata and the iteration parameters in lists, 
and invoke the rhierLinearModel method of the bayesm 
package. It takes about half a minute for 2,000 MCMC 
iterations on an average CPU. 
> Data «- list(regdata-regdata) 

Mcmc «- list(R-2000) 

set.seed(7831) 

system.time(out «- bayesm::rhierLinearModel( 

Data-Data, 
Mcmc=Mcmc) ) 


not specified -- putting in iota 


N + + VM M 


Starting Gibbs Sampler for Linear Hierarchical Model 
88 Regressions 
1 Variables in Z (if 1, then only intercept) 


EE  ————— 


Ak athshala 
Example: 9 uTO SITeIT 


» We can perform the same MCMC simulation with an 
identically named method in rpudplus. Note the extra output 
option that we explicitly set to be bayesm for compatibility. 


23 / 23 


vt. of India 


@Qathshala 
Example: eS UST oT 


> We can perform the same MCMC simulation with an 
identically named method in rpudplus. Note the extra output 
option that we explicitly set to be bayesm for compatibility. 


> beta.3 «- mean(as.vector(out$betadraw[, 3, 201:2000]) 
> beta.3 


[1] -2.145307 


vt. of India 


@Qathshala 
Q 


Example: "$ UTSTAT 


» We can perform the same MCMC simulation with an 
identically named method in rpudplus. Note the extra output 
option that we explicitly set to be bayesm for compatibility. 
» beta.3 «- mean(as.vector(out$betadraw[, 3, 201:2000]) 
> beta.3 


[1] -2.145307 


> A 5% increase of the unit price amounts to an increment of 
log(1.05) in the log price. 


vt. of India 


Ë Jathshala ` ; 
o 


Example: "$ YTOSITeIT 


» We can perform the same MCMC simulation with an 
identically named method in rpudplus. Note the extra output 
option that we explicitly set to be bayesm for compatibility. 
» beta.3 «- mean(as.vector(out$betadraw[, 3, 201:2000]) 
> beta.3 


[1] -2.145307 


> A 596 increase of the unit price amounts to an increment of 
log(1.05) in the log price. 


> The computation below shows that the sales volume is 
expected to decrease by 10% on average. 


> exp(beta.3 * log(1.05)) 
[1] 0.9006218 


athshala 
c |) ITO SITeIT 


A Gateway to all Postgraduate Courses An MHRD project under its National Mission on Education through ICT (NME-ICT) 


Subject: Statistics 


Paper: Statistical Inference 
Module: Advanced Hierarchical Models - Part 
2 


MHRD 


Iil Govt. of India 


@Qethshala 
Development Team oS UST AT 


Principal investigator: Dr. Bhaswati Ganguli, Professor, 
Department of Statistics, University of Calcutta 

Paper co-ordinator: Dr. Dipak K Dey, Associate Dean and BOT 
Distinguished Professor, Department of Statistics, 
University of Connecticut 

Content writer: Dr. Sourish Das, Assistant Professor, Chennai 
Mathematical Institute 


Content reviewer: Department of Statistics, University of Calcutta 


vt. of India 


Outline 


1. Review of the basic hierarchical regression probability model 


08 Jathshala 
Outline v: 9 UTOSITeIT 


1. Review of the basic hierarchical regression probability model 


2. Implementation of Bayesian hierarchical models 


. . . Ak athshala 
Hierarchical Regression Models ) STS RIT eT 


» Suppose we have a standard multiple regression model where 
observations i cluster across sub-populations j. 
where 7 indexes, for example, geographic location, social 
group, period in history. 


. . e Jathshala 
Hierarchical Regression Models QUST 


> Suppose we have a standard multiple regression model where 
observations i cluster across sub-populations j. 
where 7 indexes, for example, geographic location, social 
group, period in history. 


> But, we do not want to assume that regression coefficients are 
identical across sub-populations. 


@Qathshala 
Hierarchical Regression Models QUST 


> Suppose we have a standard multiple regression model where 
observations i cluster across sub-populations j. 
where 7 indexes, for example, geographic location, social 
group, period in history. 


> But, we do not want to assume that regression coefficients are 
identical across sub-populations. 


> We also want to allow for unequal variances across 
sub-populations. 


Ë Jathshala 
A 


Hierarchical Regression Models “HUTS AT 


» Suppose we have a standard multiple regression model where 
observations i cluster across sub-populations j. 
where 7 indexes, for example, geographic location, social 
group, period in history. 


> But, we do not want to assume that regression coefficients are 
identical across sub-populations. 


» We also want to allow for unequal variances across 
sub-populations. 


» We assume that each observation is distributed normally with 
an expected value determined by both observation-specific 
and sub-population characteristics and level of 
aggregation-specific variance. Thus, 


yij ~ N(mij,t;) 


The Random Coefficient Model 


» Suppose 
yij  N(mij,t;) 


then 
mij = Boj + BijXij +.» + Pej Xb; 


a Jathshala 
v 


The Random Coefficient Model YTS AT 


> Suppose 
yij ~ N(mij,t;) 
then 
mij = Boj + BijXij t + Pej Xb; 


> For a random coefficient (hierarchical) model, we assume 
that: 
bkj = Yk + Ókj, 
where ^j, represnts overall effect of 5j. 


0; represents the difference in the coefficient between 
sub-population j and the overall coefficient, where E[ój;] = 0 


Random Coefficient Model 
and its Prior 


> Suppose that yi; ~ N (Mij, tj) 
Mij = boj + bij Xij ares bmj X mj 
then 


Mij = (Yo + ĝoj) + (y1 + 615) Xyj (m + ij) Xm; 


Random Coefficient Model a nthahal 
and its Prior ETT 


> Suppose that yi; ~ N(mij,t;) 
Mij = boj + bij Xij ares bmj X mj 
then 


Mij = (Yo + ĝoj) + (Y1 + àj) Xij +- + (m + ij) Xm; 


> We shall assume that t; ~ Gamma(0.001, 0.001) for all j 


Random Coefficient Model a nthahal 
and its Prior eee 


» Two basic strategies for defining priors for the coefficients: 
1. Specify priors for both ^; and ôk; as follows: 
Yk ~ N (prior mean, prior prec) and à; ~ N(0,7;,) where 
Tk ~ Gamma(0.001, 0.001) 


Random Coefficient Model a nthahaín 
and its Prior eee 


» Two basic strategies for defining priors for the coefficients: 
1. Specify priors for both yk and ôk; as follows: 
Yk ~ N (prior mean, prior prec) and à; ~ N(0,7;,) where 
Tk ~ Gamma(0.001, 0.001) 


2. Use "Hierarchical-centering" as follows: By; ~ N (Yk, Tk) where 
^i ~ N (prior mean, prior prec) and 
Ty ~ Gamma(0.001, 0.001) 


athshala 


Random Coefficient Model &R 
9$ UTOSITeIT 


and its Prior 


> Two basic strategies for defining priors for the coefficients: 
1. Specify priors for both ^; and ôk; as follows: 
Yn ~ IN (prior mean, prior prec) and ôk; ~ N(0,7;,) where 
Tk ~ Gamma(0.001, 0.001) 


2. Use "Hierarchical-centering" as follows: By; ~ N (Yk, Tk) where 
^x ~ N (prior mean, prior prec) and 
Ty ~ Gamma(0.001, 0.001) 
> Method 2 improves MCMC markedly in some cases (see Gilks 
and Roberts, “Strategies for improving MCMC” in MCMC in 
Practice) 


08 Jathshala 
Example from last Module v 9 UTOSITeIT 


» Dependent variable: 


Percentage of seats won by the Democratic Party in the 
House of Representatives in state i in election t. 


MHRD 
J]!N Govt. of India 
E0, 


@Qathshala 
Example from last Module eS UST 


> Dependent variable: 


Percentage of seats won by the Democratic Party in the 
House of Representatives in state i in election t. 


» Independent variable: 


> Level of ideological conflict within state ?'s Democratic Party 
delegation to the House in period ¢ — 1. 


Ë Jathshala 
A 


Example from last Module "$ UTSTAT 


» Dependent variable: 


Percentage of seats won by the Democratic Party in the 
House of Representatives in state i in election t. 


» Independent variable: 


» Level of ideological conflict within state ?'s Democratic Party 
delegation to the House in period £ — 1. 


> Control variables include dummy variables for the various 
states measuring their preference for the Democratic Party 
and for each election. 


Hierarchical Binomial Linear Regression AQathshata 
Model using the logit link TIE 


» The model takes the following form: 


yi ~ Bernoulli(0;) 


Hierarchical Binomial Linear Regression ip atnshata 
Model using the logit link TIE 


» The model takes the following form: 


yi ~ Bernoulli(0;) 


> With latent variables /(0), I(.) being the logit link function: 


l(0;)) = Xi * B+ Wi * b; + ei 


Hierarchical Binomial Linear Regression Ë athshala 
Model using the logit link TIE d 


» The model takes the following form: 


yi ~ Bernoulli(0;) 


> With latent variables /(0), I(.) being the logit link function: 


l(0;)) = Xi * B+ Wi * b; + ei 


> where each group i have k; observations. 


Hierarchical Binomial Linear Regression Ë athshala 
Model using the logit link 2 ca i 


» The model takes the following form: 


yi ~ Bernoulli(0;) 


> With latent variables /(0), (.) being the logit link function: 


l(0;)) = Xi * B+ Wi * b; + ei 


> where each group i have k; observations. 


» the random effects: 


MHRD 
JJ Govt. of India 
mo 


Hierarchical Binomial Linear Regression AQathshata 
Model using the logit link TIE 


» the over-dispersion terms: 


bm 


e; ~ N(0,07l,) 


10/17 


MHRD 


| Govt. of India 


Hierarchical Binomial Linear Regression AQathshata t 
Model using the logit link Rah 
» the over-dispersion terms: 


e; ~ N (0, o°lp,) 


> We assume standard, conjugate priors: 


B ~ Ny(ug, Va) 


10/17 


Hierarchical Binomial Linear Regression ip atnshata E 
Model using the logit link nei B" 


» the over-dispersion terms: 


e; ~ N (0, o°lp,) 


> We assume standard, conjugate priors: 


B ~ Ny(ug, Va) 


c? ~ IGamma(v, 1/6) 


10/17 


Hierarchical Binomial Linear Regression ip atnshata E 
Model using the logit link nei B" 


» the over-dispersion terms: 


e; ~ N (0, o°lp,) 


> We assume standard, conjugate priors: 


B ~ Ny(ug, Va) 


o? ~ IGamma(v, 1/5) 


V, ~ IWishart(r, r R) 


10/17 


Hierarchical Binomial Linear Regression i atnshata 
Model using the logit link ect 


> It is difficult to have default parameters for the priors on the 
precision matrix for the random effects. 


11/17 


MHRD 


ovt. of India 


Hierarchical Binomial Linear Regression Ë athshala 
Model using the logit link al 


» |t is difficult to have default parameters for the priors on the 
precision matrix for the random effects. 


> When fitting one of these models, it is of utmost importance 
to choose a prior that reflects your prior beliefs about the 
random effects. 


11/17 


MHRD 


ovt. of India 


Hierarchical Binomial Linear Regression ip atnshata 
Model using the logit link al 


> |t is difficult to have default parameters for the priors on the 
precision matrix for the random effects. 


» When fitting one of these models, it is of utmost importance 
to choose a prior that reflects your prior beliefs about the 
random effects. 


> Using the dwish and rwish functions might be useful in 
choosing these values. 


11/17 


"Govt. of India 


d D Jathshala 
Implementation in R s 9 UTOSITeIT 


> In MCMCpack R-package the MCMChlogit implement the 
Hierarchical Binomial Linear Regression Model with the logit 
link function 


== Call to MCMChlogit 

model <= MCMChlogit(fixed-Y-X14X2, random--X14X2, group="species", 
data=Data, burnin=1000, mcmc=1000, thin=1,verbose=1, 
seed-NA, beta.start-0, sigma2.start=1, 
Vb.start-1, mubeta-0, Vbeta-1.0E6, 
r-3, R-diag(c(1,0.1,0.1)), nu-0.001, delta-0.001, Fix0D=1) 


@Qethshala 
Implementation in R eS UST aT 


Oo 
D uo 
a O 
oO 
— 
[9 
$5 q 
Pio 
ib) 
TO 
o 
E o 
e 


0.0 0.4 0.8 


DataStheta 


13/17 


MHRD 


| Govt. of India 


Hierarchical Poisson Linear Regression ree a ! 
Model using the log link function TIE 
» The model takes the following form: 


yi ~ Poisson(A;) 


14/17 


Hierarchical Poisson Linear Regression 
Model using the log link function 


» The model takes the following form: 


yi ~ Poisson(A;) 


> With latent variables /(0), I(.) being the log link function: 


l(0;)) = Xi * B+ Wi * bj + ei 


14/17 


Hierarchical Poisson Linear Regression Ë athshala 
Model using the log link function TIE 


» The model takes the following form: 


yi ~ Poisson(A;) 


> With latent variables /(0), I(.) being the log link function: 


l(0;) = Xi * B + Wi * b; + ei 


> where each group i have k; observations. 


14/17 


Hierarchical Poisson Linear Regression Ë athshala 
Model using the log link function eee d 


» The model takes the following form: 


yi ~ Poisson(A;) 


> With latent variables /(0), I(.) being the log link function: 


l(0;) = Xi * B + Wi * b; + ei 


> where each group i have k; observations. 


» the random effects: 
bi ~ Nq(0, Vi) 


14/17 


Hierarchical Poisson Linear Regression 
Model using the log link function 


» the over-dispersion terms: 


e; ~ N(0,07l,) 


15/17 


Hierarchical Poisson Linear Regression 
Model using the log link function 


» the over-dispersion terms: 


e; ~ N (0, o°lp,) 


» We assume standard, conjugate priors: 


B ~ Ny(ug, Va) 


15 


17 


Hierarchical Poisson Linear Regression 
Model using the log link function 


» the over-dispersion terms: 


e; ~ N (0, o°lp,) 


» We assume standard, conjugate priors: 


B ~ Ny(ug, Va) 


c? ~ IGamma(v, 1/6) 


15 


17 


Hierarchical Poisson Linear Regression 
Model using the log link function 


» the over-dispersion terms: 


e; ~ N (0, o°lp,) 


» We assume standard, conjugate priors: 


B ~ Ny(ug, Va) 


o? ~ IGamma(v, 1/5) 


V, ~ IWishart(r,r R) 


15 


17 


MHRD 


(Govt. of India 


Ë Jathshala 
Implementation in R QUST 


> In MCMCpack R-package the MCMChpoisson implement 
the Hierarchical Poisson Linear Regression Model with the log 
link function 


#== Call to MCMChpoisson 
model <- MCMChpoisson(fixed-Y-X14X2, random=~X1+X2, group="species", 
data-Data, burnin-500, mcmc-1000, thin-1,verbose-1, 
seed=NA, beta.start=0, sigma2.start-1, 
Vb.start=1, mubeta-0, Vbeta=1.0E6, 
r-8, R-diag(c(0.1,0.1,0.1)), mu=0.001, delta-0.001, FixOD=1) 


"MHRD 


. . Ak athshala 
Implementation in R ) ITS SITeIT 


EJ 
v 
a o 
o N 
E] 
E 
© Wm 
a c 
E] 
EJ 
o 
E 0 
e 


05 10 15 20 


Data$lambda 


A's are overestimated by the models 


EV] 


GE ao NW iGevt. of India 
CCH 


A Gateway to all Postgraduate Courses An MHRD project under its National Mission on Education through ICT (NME-ICT) 


Subject: Statistics 


Paper: Statistical Inference 


Module: Bayesian Generalized Linear Models 


MHRD 


Iil Govt. of India 


@Qethshala 
Development Team eS UST 


Principal investigator: Dr. Bhaswati Ganguli, Professor, 
Department of Statistics, University of Calcutta 

Paper co-ordinator: Dr. Dipak K Dey, Associate Dean and BOT 
Distinguished Professor, Department of Statistics, 
University of Connecticut 

Content writer: Dr. Sourish Das, Assistant Professor, Chennai 
Mathematical Institute 


Content reviewer: Department of Statistics, University of Calcutta 


Outline 


1. Generalized Linear Model (GLM) 


Outline 


1. Generalized Linear Model (GLM) 


2. Bayesian setup of GLM 


: Ë Jathshala z RD 
Outline SIUSA ey" 


1. Generalized Linear Model (GLM) 
2. Bayesian setup of GLM 


3. R-implementation 


Review of GLMS 


» |n general, statistical models contain both systematic and 
random components. 


e Jathshala 
© S WTO aT 


Review of GLMS 


» |n general, statistical models contain both systematic and 
random components. 


> For the standard linear model, we assume that Y (the dep. 
var) is a vector of random variables whose components are 
independently distributed with mean ju 


Ë Jathshala 
o OUTSET 


Review of GLMS 


> In general, statistical models contain both systematic and 
random components. 


> For the standard linear model, we assume that Y (the dep. 
var) is a vector of random variables whose components are 
independently distributed with mean u 


> u represents the systematic component of the model (the 
expected value of Y) and is assumed to be a linear function of 
explanatory variables X and parameters b. 


Review 


Ë Jathshala 
oS UTOSITeIT 


of GLMS 


In general, statistical models contain both systematic and 
random components. 


For the standard linear model, we assume that Y (the dep. 
var) is a vector of random variables whose components are 
independently distributed with mean u 


u represents the systematic component of the model (the 
expected value of Y) and is assumed to be a linear function of 
explanatory variables X and parameters b. 


The random part of the model (the unexplainable error terms) 
are assumed to be independent with constant error variance. 


'"MHRD 


athshala 
Review of GLM estet 


> GLM has three important components: 
1. random component: each observation or component of Y 
has an independent normal distribution with E(Y) = u and 
constant variance c?. 


Ë Jathshala 
Review of GLM © €) UTOSITeIT 


» GLM has three important components: 
1. random component: each observation or component of Y 
has an independent normal distribution with E(Y) = u and 
constant variance c?. 


2. systematic component: covariates X produce a linear 
predictor n = X 8 


Ë Jathshala 
Review of GLM $^ €) UTOSITeIT 


» GLM has three important components: 
1. random component: each observation or component of Y 
has an independent normal distribution with E(Y) = u and 
constant variance c?. 


2. systematic component: covariates X produce a linear 
predictor n = X 8 


3. link function: The link between the random and the 
systematic components u = 77 


MHRD 


T ovt. of India 


Ë Jathshala 
Review of GLM © QUST 
> GLM has three important components: 


1. random component: each observation or component of Y 
has an independent normal distribution with E(Y) = u and 
constant variance c?. 


2. systematic component: covariates X produce a linear 
predictor n = X 


3. link function: The link between the random and the 
systematic components u = 7] 
> For the normal linear model, the link states that the linear 
predictor 7 = X B is identical to the expected value of the 
random component. 


Ë Jathshala 
A 


Review of GLM “Q UTOM aT 


> GLM has three important components: 


1. random component: each observation or component of Y 
has an independent normal distribution with E(Y) = u and 
constant variance c?. 


2. systematic component: covariates X produce a linear 
predictor n = X 


3. link function: The link between the random and the 
systematic components u = 7] 
> For the normal linear model, the link states that the linear 
predictor 7 = X B is identical to the expected value of the 
random component. 


> However, more generally, 7; = g(ui), where g() is called the 
link function. 


208 Jathshala 
GLM 9 QTOSITeIT 


> General linear models extend the above setup to the case 
where: 
1. the random component follows a distribution other than the 
normal 


athshala 
GLM eder 


> General linear models extend the above setup to the case 
where: 


1. the random component follows a distribution other than the 
normal 


2. the link function is a function other than that given above. 


di Jathshala 
GLM ETE 


> General linear models extend the above setup to the case 
where: 


1. the random component follows a distribution other than the 
normal 


2. the link function is a function other than that given above. 


» Common distributions other than the normal for the random 
component include... 


di Jathshala 
GLM ETT 


> General linear models extend the above setup to the case 
where: 


1. the random component follows a distribution other than the 
normal 


2. the link function is a function other than that given above. 


» Common distributions other than the normal for the random 
component include... 
1. Poisson 


di Jathshala 
GLM ETE 


> General linear models extend the above setup to the case 
where: 


1. the random component follows a distribution other than the 
normal 


2. the link function is a function other than that given above. 


» Common distributions other than the normal for the random 
component include... 
1. Poisson 


2. Bernoulli / Binomial 


MHRD 


Iil Govt. of India 


di Jathshala 
GLM e ein 


> General linear models extend the above setup to the case 
where: 


1. the random component follows a distribution other than the 
normal 


2. the link function is a function other than that given above. 


» Common distributions other than the normal for the random 
component include... 
1. Poisson 


2. Bernoulli / Binomial 


3. Weibull (for modeling duration) 


MHRD 


Ii Govt. of India 


dr Jathshala 
GLM ETT 


> General linear models extend the above setup to the case 
where: 


1. the random component follows a distribution other than the 
normal 


2. the link function is a function other than that given above. 


» Common distributions other than the normal for the random 
component include... 
1. Poisson 


2. Bernoulli / Binomial 
3. Weibull (for modeling duration) 


4. Multinomial 


GLM 


> The link function relates the linear predictor 7 to an 
observation y. 


MHRD 
Hil Govt. of India 


dr Jathshala 
GLM Greer 


> The link function relates the linear predictor 7 to an 
observation y. 


> In classical linear models, where the dependent variable is 
normal, we use the identity link. Since the expected value of 
the linear predictor can take any value for the normal 
distribution, the identity link makes sense. 


MHRD 


T ovt. of India 


a Jathshala 
GLM © S WOM AT 


> The link function relates the linear predictor 7 to an 
observation y. 


> In classical linear models, where the dependent variable is 
normal, we use the identity link. Since the expected value of 
the linear predictor can take any value for the normal 
distribution, the identity link makes sense. 


> For a Poisson (count) model, ju > 0, so the identity link is less 
attractive because the linear predictor 7 can be negative while 
u cannot. 


Ë Jathshala 
A 


GLM “YTS AT 


> The link function relates the linear predictor 7 to an 
observation y. 


> In classical linear models, where the dependent variable is 
normal, we use the identity link. Since the expected value of 
the linear predictor can take any value for the normal 
distribution, the identity link makes sense. 


> For a Poisson (count) model, ju > 0, so the identity link is less 
attractive because the linear predictor 7 can be negative while 
u cannot. 


> So, for Poisson models we typically use the log link, 
n = log(u), or its inverse u = exp{n}. 


208 Jathshala 
GLM v: € TO SITeIT 


> Link function maps the linear predictor which can take any 
value along the real line to a set of plausible expected values. 


t. of India 


di Jathshala 
GLM O 


> Link function maps the linear predictor which can take any 
value along the real line to a set of plausible expected values. 


> For a Bernoulli model, 0 < u < 1, so the identity link is 
unattractive. So, for Bernoulli models, we typically use the 


logit link, 7 = log(u/(1 — p)) 


vt. of India 


di p NES 
Bayesian Setup of the GLM +9 ITO SITeIT 


> The Bayesian setup for the GLM is a very natural extension of 
the framework we have used for regression models. 


vt. of India 


Ak athshala 
Bayesian Setup of the GLM ) STS RIT AT 


> The Bayesian setup for the GLM is a very natural extension of 
the framework we have used for regression models. 
1. Specify the probability distribution for the dependent variable 
in your model. 
Gar N (hi, t) 


vt. of India 


Ag athshala 
Bayesian Setup of the GLM ) STS RIT AT 


> The Bayesian setup for the GLM is a very natural extension of 
the framework we have used for regression models. 
1. Specify the probability distribution for the dependent variable 
in your model. 


Yi ~ N(mi,t) 


2. Define the linear predictor for your model. 


by +b2X 


vt. of India 


e Jathshala 
Bayesian Setup of the GLM +9 FST 


> The Bayesian setup for the GLM is a very natural extension of 
the framework we have used for regression models. 
1. Specify the probability distribution for the dependent variable 
in your model. 


Yi ~ N (mi, t) 
2. Define the linear predictor for your model. 


by +b2X; 


3. Choose the link function that maps from the linear predictor 7 
to a set of plausible expected values for u 


Hi = by + b2 Xj 


vt. of India 


@Qathshala 
Bayesian Setup of the GLM +9 ITOSITeIT 


> The Bayesian setup for the GLM is a very natural extension of 
the framework we have used for regression models. 
1. Specify the probability distribution for the dependent variable 
in your model. 


Yi ~ N (mi, t) 
2. Define the linear predictor for your model. 


by +b2X; 


3. Choose the link function that maps from the linear predictor 7 
to a set of plausible expected values for u 


Li = by 03 Xi 


4. Choose priors for all of the parameters in your model. 


b; ~ N(0,0.001) and t ~ Gamma(0.1, 0.1) 


Logistic Regression 
Bayesian Approach 


> Suppose that y; ~ Bernoulli(p;), 


10 / 18 


Logistic Regression a athshala & 
Bayesian Approach 3 TIE a» 


> Suppose that y; ~ Bernoulli(p;), 


> To ensure that 0 « p; « 1, we use the logit transformation so, 
logit(pi) = Bo + £1Xiyi +... + PkXxi 


we assume 3; ~ IN (0,0.001) for all j. 


10 / 18 


Logistic Regression Ë athshala 
Bayesian Approach al 


vt. of India 


> Suppose that y; ~ Bernoulli(p;), 


> To ensure that 0 < p; < 1, we use the logit transformation so, 
logit(pi) = Bo + &4Xyi t+... + BkXki 


we assume 3; ~ N (0, 0.001) for all j. 


» notice-there is no prior distribution for the variance of y 
because there is not a parameter in the Bernoulli distribution 
for variance. 


10/18 


Logistic Regression Ë athshala 
Bayesian Approach al 


> Thus, the joint posterior distribution of the parameters is 
given by the following expression: 


p(bo, b1, 05, ..., bly, X) X p(bo, b1, 05, ..., bk) 


L»Gilbo Ty X45 +... be Xy) 
i=l 


œx p(bo, bi, ba, ..., b ope (1— pj)! *: 
x p(bo, bı, ba, -.-, bk) 
J [logit (bo + ixi; +... + bp X3) 
i—l 


logit (1 — (bo Tb Xq; +... + bkXki)) 


11/18 


Logistic Regression Ë athshala 
Bayesian Approach al 


> Thus, the joint posterior distribution of the parameters is 
given by the following expression: 


p(bo, b1, 05, ..., bly, X) X p(bo, b1, 05, ..., bk) 


L»Gilbo Ty X45 +... be Xy) 
i=l 


œx p(bo, bi, ba, ..., b ope (1— pj)! *: 
x p(bo, bı, ba, -.-, bk) 
J [logit (bo + ixi; +... + bp X3) 
i—l 


logit (1 — (bo Tb Xq; +... + bkXki)) 


11/18 


Logistic Regression Ë athshala 
Bayesian Approach al 


> Thus, the joint posterior distribution of the parameters is 
given by the following expression: 


p(bo, b1, 05, ..., bly, X) X p(bo, b1, 05, ..., bk) 


L»Gilbo Ty X45 +... be Xy) 
i=l 


œx p(bo, bi, ba, ..., b ope (1— pj)! *: 
x p(bo, bı, ba, -.-, bk) 
J [logit (bo + ixi; +... + bp X3) 
i—l 


logit (1 — (bo Tb Xq; +... + bkXki)) 


11/18 


Logistic Regression 
Bayesian Approach 


» Choce of prior on regression coefficients: 


Logistic Regression 
Bayesian Approach 


» Choce of prior on regression coefficients: 
1. default improper uniform prior 


Logistic Regression 
Bayesian Approach 


» Choce of prior on regression coefficients: 


1. default improper uniform prior 
2. independent normal prior 


'"MHRD 


ovt. of India 


Logistic Regression a 


Jathshala 
Bayesian Approach v S UST oT 


» Choce of prior on regression coefficients: 
1. default improper uniform prior 
2. independent normal prior 
3. multivariate normal prior 


'"MHRD 


ovt. of India 


Logistic Regression a 


Jathshala 
Bayesian Approach © S UST 


> Choce of prior on regression coefficients: 
1. default improper uniform prior 
2. independent normal prior 
3. multivariate normal prior 
4. Zellner's g-prior 


Logistic Regression 
Bayesian Approach 


> Choce of prior on regression coefficients: 
1: 


rr Qo e 


default improper uniform prior 
independent normal prior 
multivariate normal prior 
Zellner's g-prior 

independent Cauchy prior 


'"MHRD 


ovt. of India 


Logistic Regression 
Bayesian Approach 


» Choce of prior on regression coefficients: 
1: 


UE d Ed 


default improper uniform prior 
independent normal prior 
multivariate normal prior 
Zellner's g-prior 

independent Cauchy prior 

Lasso or double exponential prior 


a Jathshala 
eS UTOSITeIT 


'"MHRD 


ovt. of India 


Logistic Regression 
Bayesian Approach 


» Choce of prior on regression coefficients: 
1: 


UE d Ed 


default improper uniform prior 
independent normal prior 
multivariate normal prior 
Zellner's g-prior 

independent Cauchy prior 

Lasso or double exponential prior 


a Jathshala 
eS UTOSITeIT 


'"MHRD 


ovt. of India 


o UToSITelT vt. of India 


@Qethshala 
Application: birtwt dataset Q 


> library (MCMCpack) 

> ## default improper uniform prior 

> data(birthwt) 

> posterior <- MCMClogit (low~agetas. factor (race)+smoke, 
> round (summary (posterior) $quantiles, digit=3) 


2.5% 25% 50% 75% 97.5% 
(Intercept) -2.724 -1.601 -1.007 -0.319 0.872 
age -0.110 -0.061 -0.037 -0.015 0.026 
as.factor(race)2 0.070 0.698 1.039 1.383 2.017 
as.factor(race)3 0.305 0.793 1.061 1.364 1.950 
smoke 0.386 0.854 1.141 1.395 1.888 


> 


13 / 18 


vt. of India 


@Qethshala 
o 


Application: birtwt dataset "$ YTOSITeIT 


> library (MCMCpack) 

> ## multivariate normal prior 

> data(birthwt) 

> posterior <- MCMClogit (low~agetas. factor (race)+smoke, 
+ , data=birthwt) 

> round (summary (posterior) $quantiles, digit=3) 


2.5% 25% 50% 75% 97.5% 
(Intercept) -2.799 -1.691 -1.064 -0.409 0.790 
age -0.107 -0.059 -0.035 -0.013 0.030 
as.factor(race)2 0.036 0.699 1.023 1.372 2.017 
as.factor(race)3 0.291 0.802 1.085 1.369 1.920 
smoke 0.380 0.874 1.132 1.400 1.924 


> 


14 / 18 


Poisson Regression 


> Suppose that y; ~ Poisson(A;) 


15/18 


Poisson Regression 


> Suppose that y; ~ Poisson(A;) 


> To ensure that A; > 0, we use the log link function, 


log(A;i) = Bo + £1Xai t+... + Bk Xii 


15/18 


Poisson Regression 


> Suppose that y; ~ Poisson(A;) 


> To ensure that A; > 0, we use the log link function, 


log(A;i) = Bo + £1Xii t+... + Bk Xii 


> b; ~ N(0,0.001) for all j 


athshala 


MCMC Implementation for GR 
~ YUTSA oT 


Poisson Regression 


> counts <- c(18,17,15,20,10,20,25,13,12) 

> outcome <- g1(3,1,9) 

> treatment <- gl1(3,3) 

> posterior <- MCMCpoisson(counts ^ outcome + treatment, 
> round (summary (posterior) $quantiles, digit=3) 


2.5% 25% 50% 75% 97.5% 
(Intercept) 2.662 2.906 3.030 3.154 3.355 
outcome2 -0.828 -0.578 -0.457 -0.319 -0.059 
outcome3 -0.676 -0.412 -0.286 -0.157 0.087 
treatment2  -0.388 -0.138 0.000 0.143 0.394 
treatment3  -0.400 -0.146 -0.003 0.130 0.388 


t. of India 


2. . Ag athshala 
Multinomial Logistic Regression ) STS RIT AT 


> The model takes the following form: 
yi ~ Multinomial(n;) 


where: 


p 
Tij — exp(z;8)/[9 exp(z5,8)] 


k=1 
We assume a multivariate Normal prior on beta: 


B N(bo, Bg) 


17/18 


Ë Oathshala 1) 
D 


MCMC for Multinomial Logit Regression "SJMTOSITeIT dis" 
> data(Nethvote) 
> post«- MCMCmnl(vote ^ 
+ relig + class + income + educ + age + url 
+ baseline="D66", mcmc.method="IndMH", BO=( 
+ verbose=0, mcmc=1000, thin=1, tune=0.5, 
+ data=Nethvote) 


Calculating MLEs and large sample var-cov matrix. 


This may take a moment... 


Inverting Hessian to get large sample var-cov matrix. 


>  head(round (summary (post) $quantile,digit=3)) 


2.5% 25% 
(Intercept).CDA -0.773 -0.320 
(Intercept).PvdA 1.970 2.312 


(Intercept). VVD -1.181 -0.744 
relig.CDA 1.947 2.331 
relig.PvdA -0.118 0.089 


50% 75% 97.5% 


.075 0.171 0.626 
.475 2.677 3.028 
.507 -0.220 0.129 
.430 2.540 2.916 
.220 0.364 0.659 


18 / 18 


MHRD 


ovt. of India 


athshala 
ef ) UTS SITeIT 


A Gateway to all Postgraduate Courses An MHRD project under its National Mission on Education through ICT (NME-ICT) 


Subject: Statistics 


Paper: Statistical Inference 
Module: Missing Data Models 


MHRD 


Hi Govt. of India 


@Qethshala 
Development Team eS UST 


Principal investigator: Dr. Bhaswati Ganguli, Professor, 
Department of Statistics, University of Calcutta 

Paper co-ordinator: Dr. Dipak K Dey, Associate Dean and BOT 
Distinguished Professor, Department of Statistics, 
University of Connecticut 

Content writer: Dr. Sourish Das, Assistant Professor, Chennai 
Mathematical Institute 


Content reviewer: Department of Statistics, University of Calcutta 


Outline 


1. Introduction to missing data models 


Outline 


1. Introduction to missing data models 


2. Types of missing data 


08 Jathshala 
Outline v: 9 UTOSITeIT 


1. Introduction to missing data models 


2. Types of missing data 


3. Corrections for missing data in theory 


JJ Govt. of India 


Ë Jathshala 
Outline © €) UTOSITeIT 


1. Introduction to missing data models 
2. Types of missing data 
3. Corrections for missing data in theory 


4. Missing data in practice 


Bethea! athshala '"MHRD 


Introduction to Missing Data QUST 


> It is well known that on average about half of the respondents 
to surveys in political science research do not answer on or 
more of the questions used in the analysis. 


MHRD 


Hi Govt. of India 


Ë Jathshala 
Introduction to Missing Data QUST 


> |t is well known that on average about half of the respondents 
to surveys in political science research do not answer on or 
more of the questions used in the analysis. 


> Almost all analysts “contaminate” their data by filling in 
educated guesses for some of these questions (e.g. for party 
identification questions, don't know = independent). 


MHRD 


T ovt. of India 


Ë Jathshala 
A 


Introduction to Missing Data "$ YTOSITeIT 


> |t is well known that on average about half of the respondents 
to surveys in political science research do not answer on or 
more of the questions used in the analysis. 


» Almost all analysts "contaminate" their data by filling in 
educated guesses for some of these questions (e.g. for party 
identification questions, don't know — independent). 


> Even if these guesses are correct on average, filling in missing 
cells of our data matrix in this way biases our regression 
coefficients’ standard errors downwards. 


08 Jathshala 
Introduction to Missing Data «S UST 


» When educated guesses are not possible, the standard 
"remedy" is listwise deletion of missing data, eliminating entire 
observations in a wholesale manner. 


athshala 
Introduction to Missing Data Getestet 


» When educated guesses are not possible, the standard 
"remedy" is listwise deletion of missing data, eliminating entire 
observations in a wholesale manner. 


» Valuable information is lost, and severe selection bias effects 
are possible. 


'"MHRD 


vt. of India 


athshala 
Some notation eder 


>» Let D denote the data matrix, where D includes independent 
and dependent variables. D = (X, y}. 


MHRD 
JJ Govt. of India 
RO 


athshala 
Some notation eder 


aia dh 


> Let D denote the data matrix, where D includes independent 
and dependent variables. D = (X, y}. 


» We assume that some elements of the data matrix are missing. 


MHRD 


Iil Govt. of India 


@Qethshala 
Some notation © 9 OTST AT 


> Let D denote the data matrix, where D includes independent 
and dependent variables. D = {X, y}. 


> We assume that some elements of the data matrix are missing. 


> Let M denote the missingness indicator matrix with the same 
dimensions of D. Each element of M is a one or zero that 
indicates whether or not an element of D is missing. 


Some notation 


v 


MHRD 


T ovt. of India 


Ë Jathshala 
s A UTS 9ITelT 


Let D denote the data matrix, where D includes independent 
and dependent variables. D = {X, y}. 


We assume that some elements of the data matrix are missing. 


Let M denote the missingness indicator matrix with the same 
dimensions of D. Each element of M is a one or zero that 
indicates whether or not an element of D is missing. 


Dj; = 0 indicates the it” observation for j^^ variable is 
missing but that the data could be observed 


Some notation 


M 


Ë Jathshala 
© S WOM AT 


Let D denote the data matrix, where D includes independent 
and dependent variables. D = (X, y}. 


We assume that some elements of the data matrix are missing. 


Let M denote the missingness indicator matrix with the same 
dimensions of D. Each element of M is a one or zero that 
indicates whether or not an element of D is missing. 


Dj; = 0 indicates the it” observation for j^^ variable is 
missing but that the data could be observed 


Dij = 1 means that piece of data is present. 


athshala 
Comments Reana f 


» Comment: it is possible that data cannot be observed. 
Sometimes a "don't know” really means that the respondent 
has no basis on which to provide an answer. 


MHRD 
J]IN Govt. of India 
E0, 


é Jathshala 
v. 


Comments YTS AT 


> Comment: it is possible that data cannot be observed. 
Sometimes a "don't know” really means that the respondent 
has no basis on which to provide an answer. 


> Finally, let Dobs and Dmis denote the observed and missing parts 
of the D. D = LE ous Dine): 


n ovt. of India 


A Ak athshala 
Three Types of Missingness ) STS RIT AT 


> Missing Completely at Random: if the data are missing 
completely at random then missing values cannot be predicted 
any better with the information in D, observed or not. 


Three Types of Missingness 
» Missing Completely at Random: if the data are missing 


completely at random then missing values cannot be predicted 
any better with the information in D, observed or not. 


> Formally, M is independent of D. So, P(M|D) = P(M). 


e )athshala 
OUTSET 


vt. of India 


Three Types of Missingness 


» Missing Completely at Random: if the data are missing 
completely at random then missing values cannot be predicted 
any better with the information in D, observed or not. 


> Formally, M is independent of D. So, P(M|D) = P(M). 
> A process is missing completely at random if, say, an 


individual's decides whether to answer survey questions on the 
basis of coin flips. 


Ë Jathshala 
A 


Three Types of Missingness “HUTS AT 


» Missing Completely at Random: if the data are missing 
completely at random then missing values cannot be predicted 
any better with the information in D, observed or not. 


> Formally, M is independent of D. So, P(M|D) = P(M). 


> A process is missing completely at random if, say, an 
individual's decides whether to answer survey questions on the 
basis of coin flips. 


» |f independent are more likely to decline to answer a vote 
preference or party id question, then the data are not missing 
completely at random. 


à Jathshala 
A 


Three Types of Missingness “HUTS AT 


» Missing Completely at Random: if the data are missing 
completely at random then missing values cannot be predicted 
any better with the information in D, observed or not. 


> Formally, M is independent of D. So, P(M|D) = P(M). 


> A process is missing completely at random if, say, an 
individual's decides whether to answer survey questions on the 
basis of coin flips. 


» |f independent are more likely to decline to answer a vote 
preference or party id question, then the data are not missing 
completely at random. 

> In the unlikely event that the process is missing completely at 
random, then inferences based on listwise deletion are 
unbiased, but inefficient because we have lost some cases. 


8/31 


t. of India 


A Ak athshala 
Three Types of Missingness 9 uTóSITelT 


» Missing at Random: if the data are missing at random then 
the probability that a cell is missing may depend on Dobs, but 
after controlling for Dobs that probability must be independent 
of Das: 


Hi Govt. of India 


a Jathshala 
Three Types of Missingness © e UTOSITeIT 


> Missing at Random: if the data are missing at random then 
the probability that a cell is missing may depend on Dobs, but 
after controlling for Dobs that probability must be independent 
of Diiis: 

> |n other words, the process that determines whether or not a 
cell is missing should not depend on the values in the cell. 


e )athshala 
SA uT6SITeIT 


t. of India 


Three Types of Missingness 


> Missing at Random: if the data are missing at random then 
the probability that a cell is missing may depend on Dobs, but 
after controlling for Dobs that probability must be independent 
of Dis: 

» |n other words, the process that determines whether or not a 
cell is missing should not depend on the values in the cell. 


> Formally, M is independent of Dis : P(M|D) = P(M|Doos) 


Ë Jathshala 
A 


Three Types of Missingness “HUTS AT 


» Missing at Random: if the data are missing at random then 
the probability that a cell is missing may depend on Dobs, but 
after controlling for Dobs that probability must be independent 
of Dis: 

» |n other words, the process that determines whether or not a 
cell is missing should not depend on the values in the cell. 


> Formally, M is independent of Dis : P(M|D) = P(M|Doos) 


> For example, if Democratic identifiers are more likely to refuse 
to answer the vote choice question, then the process is 
missing at random so long as party id is a question to which 
at least some people respond. 


à Jathshala 
A 


Three Types of Missingness “HUTS AT 


> Missing at Random: if the data are missing at random then 
the probability that a cell is missing may depend on Dobs, but 
after controlling for Dobs that probability must be independent 
of Dis: 

» |n other words, the process that determines whether or not a 
cell is missing should not depend on the values in the cell. 


> Formally, M is independent of Dis : P(M|D) = P(M|Dos) 


> For example, if Democratic identifiers are more likely to refuse 
to answer the vote choice question, then the process is 
missing at random so long as party id is a question to which 
at least some people respond. 


> If data is missing at random, then inferences based on listwise 
deletion will be biased and inefficient. 


vt. of India 


uS Ak athshala 
Three Types of Missingness ) STS RIT AT 
> Non-Ignorable:if the probability that a cell is missing depends 


on the unobserved value of the missing response, then the 
process is non-ignorable. 


10 / 31 


t. of India 


uS Ak athshala 
Three Types of Missingness 9 uTóSITelT 


> Non-Ignorable:if the probability that a cell is missing depends 
on the unobserved value of the missing response, then the 
process is non-ignorable. 


> Formally, P(M|D) cannot be simplified. 


10/31 


MHRD 


T ovt. of India 


e Jathshala 
A 


Three Types of Missingness “Q YTOSITeIT 


> Non-lgnorable:if the probability that a cell is missing depends 
on the unobserved value of the missing response, then the 
process is non-ignorable. 


> Formally, P(M|D) cannot be simplified. 


> A standard example is individuals’ responses to income 
questions, where high income people are more likely to refuse 
to answer survey questions about income and other variables 
in the data set cannot predict which respondents have high 
income. 


10/31 


à Jathshala 
A 


Three Types of Missingness “HUTS AT 


> Non-lgnorable:if the probability that a cell is missing depends 
on the unobserved value of the missing response, then the 
process is non-ignorable. 


> Formally, P(M|D) cannot be simplified. 


> A standard example is individuals’ responses to income 
questions, where high income people are more likely to refuse 
to answer survey questions about income and other variables 
in the data set cannot predict which respondents have high 
income. 


> If your missing data is non-ignorable, then inferences based on 
listwise deletion will be biased and inefficient (and multiple 
imputation algorithms that we will talk about wont be of 
much aid). 


10 / 31 


e Jathshala 
Multiple Imputation of Missing Data © e UTOSITeIT 


> Multiple imputation involves imputing m values for each 
missing item and creating m completed data sets. 


11/31 


. mE Ak athshala 
Multiple Imputation of Missing Data ) TO SITeIT 


> Multiple imputation involves imputing m values for each 
missing item and creating m completed data sets. 


> Across each of the m data sets: 


if Mij = 1 then D;; = observed data 
if Mij = 0 then Dj; = an imputed value 


11/31 


MHRD 


Iil Govt. of India 


Ë Jathshala 
Multiple Imputation of Missing Data ee UST aT 


> Multiple imputation involves imputing m values for each 
missing item and creating m completed data sets. 


» Across each of the m data sets: 
if Mij = 1 then D;; = observed data 
if Mij = 0 then D;; = an imputed value 


> The imputed values for the data set are based on our guesses 
about the value of M;; together with summaries of our 
uncertainty regarding the missing values. 


11/31 


MHRD 


T ovt. of India 


a Jathshala 
A 


Multiple Imputation of Missing Data “Q UTS eT 


> Multiple imputation involves imputing m values for each 
missing item and creating m completed data sets. 


» Across each of the m data sets: 
if Mij = 1 then D;; = observed data 
if Mij = 0 then D;; = an imputed value 


> The imputed values for the data set are based on our guesses 
about the value of M;; together with summaries of our 
uncertainty regarding the missing values. 


> For each imputed data set, you perform whatever statistical 
analysis you normally would. Then, you average your results 
over the m computed analyses. 


11/31 


MHRD 
JJ Govt. of India 
RO 


athshala 
Calculating Quantities of Interest Ghai 


aiia dh 


> To estimate some quantity of interest Q such as a variable 
mean or regression coefficient, the overall point estimate q* of 
Q is the average of the m separate estimates q;. That is 


m 
“= gm 
j=l 


MHRD 
Iil Govt. of India 
SO, 


é Jathshala 
eS UTOSITeIT 


Calculating Quantities of Interest 
> To estimate some quantity of interest Q such as a variable 
mean or regression coefficient, the overall point estimate q* of 
Q is the average of the m separate estimates q;. That is 


m 
f=) g/m 
j=l 


> Let SE(q;) denote the standard error of q; for data set j and 
let S$ = zH $a — q*)? denote the sample variance 
across the m point estimates. 


. u Ë Jathshala 
Calculating Quantities of Interest +9 ITOSITeIT 


> To estimate some quantity of interest Q such as a variable 
mean or regression coefficient, the overall point estimate q* of 
Q is the average of the m separate estimates q;. That is 


m 
f=) g/m 
j=l 


> Let SE(q;) denote the standard error of q; for data set j and 
let 52 = zH IC — q*)? denote the sample variance 
across the m point estimates. 


> Then the variance of the multiple imputation point estimates 
is the weighted average of the estimated variances from within 
each completed data set plus the sample variance in the point 
estimates across the data sets. 


208 Jathshala 
Calculating Quantities of Interest «S UTOSITeIT 


> The weight is a function of the number of data sets, so that if 
m = oo, then it would be a straightforward average of the two 
sources of uncertainty. 


13 / 31 


athshala 
Calculating Quantities of Interest estet 


> The weight is a function of the number of data sets, so that if 
m = oo, then it would be a straightforward average of the two 
sources of uncertainty. 


» That is 


SE(g)? = (1/m) UE )? + S21 +1/m) 


13 / 31 


@Qethshala 
King, et al’s, imputation model FQ UTS eT 


> To implement multiple imputation, we need a statistical 
model that we can use to sample missing values. 


14 / 31 


'"MHRD 


MHRD 


Iil Govt. of India 


. . a Jathshala 
King, et al's, imputation model QUST 


> To implement multiple imputation, we need a statistical 
model that we can use to sample missing values. 


> To use King’s AMELIA program (and our most extensive R 
alternative), we assume that the data are missing at random 
conditional on the imputation model (where the model is 
defined to include the variables that provide information about 
the missing data process.) 


14 / 31 


Ë Jathshala 
A 


King, et al's, imputation model “HUTS AT 


> To implement multiple imputation, we need a statistical 
model that we can use to sample missing values. 


> To use King's AMELIA program (and our most extensive R 
alternative), we assume that the data are missing at random 
conditional on the imputation model (where the model is 
defined to include the variables that provide information about 
the missing data process.) 


> King's missing data program is based on the assumption that 
all of the variables in our model are jointly multivariate normal 
density. 
King argues that in most cases, the multivariate normal 
density is a very good approximation, even if some of our 
variables are ordinal. 


14 / 31 


MHRD 
JJ Govt. of India 
Q2 


. . . Ak athshala 
King, et al’s, imputation model ) YTOSITeIT 


bm 


» Stated more formally, King et al assume that for each 
observation i (i = 1, ..., n), D; denotes the vector of values for 
the p variables including the dependent and independent 
variables. 


15/31 


Ë Jathshala 
A 


King, et al's, imputation model "9 ITOSITeIT 


» Stated more formally, King et al assume that for each 
observation i (i = 1, ..., n), D; denotes the vector of values for 
the p variables including the dependent and independent 
variables. 


> The likelihood function for the complete data set is given by: 
i 


where u is the vector of p means and X is a p x p dimensional 
variance-covariance matrix that provides information about 
how values of the independent variables depend on one 
another. 


15 / 31 


t. of India 


. . Ag athshala 
King, et al’s, imputation model ) STS RIT AT 


> |f we assume that the data are missing at random, then the 
likelihood function for the observed data are given by: 


L(p, X|Dop;) e II N (D; obs|Hi,obs, 3 obs) 


i 


16 / 31 


a Jathshala 
© S WOM AT 


vt. of India 


King, et al's, imputation model 


> |f we assume that the data are missing at random, then the 
likelihood function for the observed data are given by: 


L(y, X| Dobs) X II N (D; obs|Hi,obs, 3 obs) 


1 
» King et al note that this will be a tough likelihood to work 


with because each observation is likely to have a different 
combination of missing values. 


16/31 


'"MHRD 


. . Ak athshala 
King, et al's, imputation model ) STS kIT AT 


aiia dh 


> The multivariate normal model implies that each missing value 
is imputed linearly. That is, we can simulate a missing value 
in the way that we would usually simulate from a regression. 


17/31 


Ë Jathshala 
King, et al's, imputation model e QUST 


> The multivariate normal model implies that each missing value 
is imputed linearly. That is, we can simulate a missing value 
in the way that we would usually simulate from a regression. 


> For example, let D;; denote a simulated value of observation i 
and variable j and let D; —; denote the vector of values of all 
[observed] variables in row 7 except j. 


TEES 


. . Ë Jathshala 
King, et al's, imputation model QUST 


> The multivariate normal model implies that each missing value 
is imputed linearly. That is, we can simulate a missing value 
in the way that we would usually simulate from a regression. 


> For example, let D;; denote a simulated value of observation i 
and variable 7 and let D; —; denote the vector of values of all 
[observed] variables in row 7 except j. 


> Then the posterior distribution of the coefficient B_; from a 


regression of D; on D_; (which can be calculated from u and 
X) can be used such that: Dj; = Di, ;B-j + ej. 


Tar 


à Jathshala 
A 


King, et al's, imputation model “HUTS AT 


> The multivariate normal model implies that each missing value 
is imputed linearly. That is, we can simulate a missing value 
in the way that we would usually simulate from a regression. 


> For example, let D;; denote a simulated value of observation i 
and variable j and let D; —; denote the vector of values of all 
[observed] variables in row 7 except 7. 


> Then the posterior distribution of the coefficient B_; from a 
regression of D; on D_; (which can be calculated from u and 
X) can be used such that: Dj; = Di, jB jt ej. 


> An alternative way to think about this is to imagine a 
multivariate normal distribution. An imputed draw from the 
multivariate normal distribution is a draw from the slice of the 
distribution for Dmis corresponding to the value of Days. 


17/31 


t. of India 


athshala 
Missing Data Algorithms Geste 


> Gary King at Harvard has a very easy to use program called 
AMELIA (now available as R package) that creates data sets 
with imputed missing values. 


18/31 


MHRD 
J]!N Govt. of India 
E0, 


é Jathshala 
v. 


Missing Data Algorithms 9 uTóSITelT 


> Gary King at Harvard has a very easy to use program called 
AMELIA (now available as R package) that creates data sets 
with imputed missing values. 


> The program is designed to read in a raw data set with 
missing values and outputs m new data sets with imputed 
missing values. 


18 / 31 


Ë Jathshala 
oS UTOSITeIT 


Missing Data Algorithms 


> Gary King at Harvard has a very easy to use program called 
AMELIA (now available as R package) that creates data sets 
with imputed missing values. 


> The program is designed to read in a raw data set with 
missing values and outputs m new data sets with imputed 
missing values. 


> You then run your analyses on the m imputed data sets, take 
averages of coefficients and standard errors as discussed 
above, and that is it. 


18/31 


Ë Jathshala 
v. 


Missing Data Algorithms 9 uTóSITelT 


» Rather than using a small number of multiply imputed data 
sets, one might incorporate a multiple imputation algorithm 
into your Gibbs Sampler, taken an independent draw from the 
multiple imputation data set with each iteration of your 
program. 


19/31 


MHRD 


T ovt. of India 


Ë Jathshala 
A 


Missing Data Algorithms “Q YTOSITeIT 


» Rather than using a small number of multiply imputed data 
sets, one might incorporate a multiple imputation algorithm 
into your Gibbs Sampler, taken an independent draw from the 
multiple imputation data set with each iteration of your 
program. 


> The advantage of this is that the posterior standard errors for 
regression coefficients already summarize all of the uncertainty 
about the process for the missing data and the uncertainty 
about the coefficients themselves. 


19/31 


Ë Jathshala 
A 


Missing Data Algorithms “HUTS AT 


» Rather than using a small number of multiply imputed data 
sets, one might incorporate a multiple imputation algorithm 
into your Gibbs Sampler, taken an independent draw from the 
multiple imputation data set with each iteration of your 
program. 


> The advantage of this is that the posterior standard errors for 
regression coefficients already summarize all of the uncertainty 
about the process for the missing data and the uncertainty 
about the coefficients themselves. 


> The disadvantage is that the imputations will be much slower 
and you will have to check for convergence. 


19 / 31 


Missing Data Algorithms 


E incomplete data 


bootstrap 


E! 
E! 


E EB B E 2° bootstrapped data 
EM 

imputed datasets 
analysis 


O separate results 


combination 


final results 


Figure 1: A schematic of our approach to multiple imputation with the EMB algo- 
rithm. 


> Figure except from Honaker, King and Blackwell (2014) 


Amelia : The R package for 
Missing Data Imputation 


> “AMELIA II: A Program for Missing Data", developed by 
Honaker, King and Blackwell is available as R-package. 


Amelia : The R package for Ë athshala 
Missing Data Imputation eee d 


>» “AMELIA II: A Program for Missing Data", developed by 
Honaker, King and Blackwell is available as R-package. 


> Missing data imputation can easily be handled in Bayesian 
methods. 


athshala 


Amelia : The R package for aR 
WTOSITeIT ay" 


Missing Data Imputation 


» “AMELIA II: A Program for Missing Data", developed by 
Honaker, King and Blackwell is available as R-package. 


> Missing data imputation can easily be handled in Bayesian 
methods. 


> Amelia provides a Bayesian setup to handle missing data 
imputation 


MHRD 


| Govt. of India 


Amelia : The R package for AQathshata t 
Missing Data Imputation Ra 


> Observation-level priors: 


Govt. of India 


Amelia : The R package for Ë athshala 
Missing Data Imputation al 


» Observation-level priors: 


» |f one has an additional prior information about missing data 
values based on previous research, academic consensus, or 
personal experience. Amelia can incorporate this information 
to produce vastly improved imputations. 


'"MHRD 


ovt. of India 


Amelia : The R package for ip atnshata 
Missing Data Imputation IIS 


> Observation-level priors: 


> If one has an additional prior information about missing data 
values based on previous research, academic consensus, or 
personal experience. Amelia can incorporate this information 
to produce vastly improved imputations. 


> The Amelia algorithm allows to include informative Bayesian 
priors about individual missing data cells. 


Amelia : The R package for ip atnshata 
Missing Data Imputation geet 


> The incorporation of priors follows basic Bayesian analysis 
where the imputation turns out to be a weighted average of 
the model-based imputation. 


MHRD 


ovt. of India 


Amelia : The R package for Ë athshala 
Missing Data Imputation al 


> The incorporation of priors follows basic Bayesian analysis 
where the imputation turns out to be a weighted average of 
the model-based imputation. 


> The prior mean, where the weights are functions of the 
relative strength of the data and prior. 


athshala 


Amelia : The R package for 8g 
v $ JTOSITeIT 


Missing Data Imputation 


> The incorporation of priors follows basic Bayesian analysis 
where the imputation turns out to be a weighted average of 
the model-based imputation. 


> The prior mean, where the weights are functions of the 
relative strength of the data and prior. 


» |f the model predicts well, the imputation will down-weight 
the prior, and vice versa (Honaker and King, 2010). 


Application 


> we consider mtcars data available R 


. , Ak athshala 
Application UTS INTATT vt. of India 
> we consider mtcars data available R 


» We consider a subset data with three variables in our analysis 
(i) mpg, (ii) hp and (iii) disp 


. , Ak athshala 
Application UTS INTATT vt. of India 
> we consider mtcars data available R 


» We consider a subset data with three variables in our analysis 
(i) mpg, (ii) hp and (iii) disp 


» We randomly delete certain cells of the data 


P GQathshala 
Application +) UTGITeIT 


M 


v 


v 


v 


MHRD 
Iil Govt. of India 
Gt) 


we consider mtcars data available R 


We consider a subset data with three variables in our analysis 
(i) mpg, (ii) hp and (iii) disp 


We randomly delete certain cells of the data 


We impute the missing values using Amelia and then we 
cross-check against the actual cell value 


Amelia : The R package for Ë athshala 
Missing Data Imputation ect 


-- Imputation 1 -- 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 
-- Imputation 2 -- 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 
-- Imputation 3 -- 


12 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 


MHRD 


ovt. of India 


Amelia : The R package for Ë athshala 
Missing Data Imputation GTI 


Observed and Imputed values of mpg 


— Mean Imputations 
—— Observed Values 
o 
e = 
e 
2 
E] 
c t 
o O 
A o 
o 
2 
E 
[7] 
[ra 
N 
e 4 
e 
o 
O 4 — 
e 


10 20 30 40 


MHRD 


ovt. of India 


Amelia : The R package for a athshala 
Missing Data Imputation GTI 


Observed and Imputed values of disp 


——- Mean Imputations 
—— Observed Values 

[s^] 
e 
3 4 
e 

g 

E 
N 

o 

o &- 

$9 

E 

[7] 

[ra 
> 
34 
eo 
e 
O 
S -4 
o 


MHRD 


ovt. of India 


Amelia : The R package for a athshala 
Missing Data Imputation GTI 


Observed and Imputed values of hp 


0.010 


——- Mean Imputations 
—— Observed Values 


0.004 0.006 0.008 
| 


Relative Density 


0.002 
| 


0.000 


MHRD 


ovt. of India 


Amelia : The R package for Ë athshala 
Missing Data Imputation S TIE 


Observed versus Imputed Values of disp 


300 400 500 
Il | | 


200 
| 


Imputed Values 


[uU 
I 
e 


0 
| 


-200 -100 
o 
K 
N 
ho 
l 
EN 
P 
o 
o 

l 
co 
& 
L 


29/31 


Govt. of India 


Amelia : The R package for Ë Qathshara 
Missing Data Imputation Raa 


Observed versus Imputed Values of mpg 


Imputed Values 


30/31 


Govt. of India 


Amelia : The R package for ip atnshata 
Missing Data Imputation al 


Observed versus Imputed Values of hp 


200 250 
| | 


150 
| 


Imputed Values 


100 
| 
t 


31/31 


athshala 
GR |) OTST 


A Gateway to all Postgraduate Courses An MHRD project under its National Mission on Education through ICT (NME-ICT) 


Subject: Statistics 


Paper: Statistical Inference 


Module: Nonparametric Bayesian Analysis: 
Dirichlet Process Models 


MHRD 


Iil Govt. of India 


@Qethshala 
Development Team eS UST 


Principal investigator: Dr. Bhaswati Ganguli, Professor, 
Department of Statistics, University of Calcutta 

Paper co-ordinator: Dr. Dipak K Dey, Associate Dean and BOT 
Distinguished Professor, Department of Statistics, 
University of Connecticut 

Content writer: Dr. Sourish Das, Assistant Professor, Chennai 
Mathematical Institute 


Content reviewer: Department of Statistics, University of Calcutta 


Outline 


1. Introduction to Dirichlet Process models 


Outline 


1. Introduction to Dirichlet Process models 


2. Stick Breaking and Chinese Resturant Process 


t. of India 


Ak athshala 
Outline ) ATO SITeIT 


1. Introduction to Dirichlet Process models 
2. Stick Breaking and Chinese Resturant Process 


3. Gibbs Sampler for Dirichlet Process models 


t. of India 


Ak athshala 
Outline ) ATO SITeIT 


1. Introduction to Dirichlet Process models 


2. Stick Breaking and Chinese Resturant Process 


3. Gibbs Sampler for Dirichlet Process models 


4. Application 


Introduction 


> The Dirichlet process is an infinite-dimensional generalization 
of the Dirichlet distribution. 


MHRD 
JJ Govt. of India 
Q2 


athshala 
Introduction estet 


bm 


> The Dirichlet process is an infinite-dimensional generalization 
of the Dirichlet distribution. 


» |t can be used to set as prior on unknown distributions. 


MHRD 
J]IN (Govt. of India 
RO, 


e Jathshala 
Introduction € S UTOSITeIT 


> The Dirichlet process is an infinite-dimensional generalization 
of the Dirichlet distribution. 


> |t can be used to set as prior on unknown distributions. 


» These unknown densities can be used to extend finite 
component mixture models to infinite component mixture 
models. 


athshala 
Motivation ECan 1 


» We are given a data set as follows: 


o 4? PM 
vo O 
ono 
M ; 
co o 8 
O 
<+ ar a o 
[0] 
Q 
o O 
N © 


athshala 
Motivation Qüseer t 


> Even if we know the data is from a mixture of Gaussian 


vt. of India 


athshala 
Motivation estet 


> it is difficult to tell how many distribution is actually there. 


Dirichlet Distribution 


> Let O = (01,05, xal] 


Dirichlet Distribution 


> Let O = (01,05, xii] 


O ~ Dirichlet(ay, a2, ..., Am) 


P885... 05) = Does IIo 


t. of India 


athshala 
Dirichlet Distribution estet 


> Let O = (01,05, xii] 


O ~ Dirichlet(ay, a2, ..., Am) 


m 


P (01,28) = ES lle 


> Distribution over possible parameter vectors for a multinomial 
distribution, and is the conjugate prior for the multinomial. 


t. of India 


Ëk athshala 
Dirichlet Distribution O UTOSITeIT 


> Let O = (01,05, xii] 


O ~ Dirichlet(ay, a2, ..., Am) 


m 


P (01,28) = ES lle 


> Distribution over possible parameter vectors for a multinomial 
distribution, and is the conjugate prior for the multinomial. 


> Beta distribution is the special case of a Dirichlet for 2 
dimensions. 


t. of India 


Bayesian Histogram 


> Yi S f (), where density f() is unknown 


vt. of India 


a Jathshala 
Bayesian Histogram QUST 


> yi ao f (), where density f() is unknown 


> The goal is to obtain a Bayes estimate of the density f. 


vt. of India 


. Ag athshala 
Bayesian Histogram ) STS RIT AT 


> yi " f (), where density f() is unknown 
> The goal is to obtain a Bayes estimate of the density f. 


> The histogram is often used as a simple form of density 
estimate. 


t. of India 


Ë Jathshala ` 
Bayesian Histogram QUST 


> Assume we have prespecified knots € = (£o, £1, ..., £i) to 
define our histogram estimate with £o < £1 < ... < £y and 


Vi € [£o, £x]. 


10/35 


t. of India 


Ëk athshala 
Bayesian Histogram T 9 UTSTAT 


> Assume we have prespecified knots € = (£0, 1, ..., Ek) to 
define our histogram estimate with £o < £1 < ... < ék and 


€ léo, Er]. 


> A probability model for the density that is analogous to the 
histogram is as follows: 


fQ)=)> laa (6 Li jx Wwe R 


with 7 = (71,..., 7%) an unknown probability vector. 


10 / 35 


Bayesian Histogram 


» Bayes specification with a prior distribution for the 
probabilities. 


11/35 


t. of India 


. . Ak athshala 
Bayesian Histogram ) STS RIT AT 


> Bayes specification with a prior distribution for the 
probabilities. 


> Assume a Dirichlet(a;, ..., ap) prior distribution for m 


p rosa) í 


11/35 


t. of India 


ËR athshala 
Bayesian Histogram ~ Y UST 


> Bayes specification with a prior distribution for the 
probabilities. 


> Assume a Dirichlet(a;, ..., ap) prior distribution for 7, 


ao II TT ai 
eS roa) MH ' 


> The posterior distribution of 7 is 


nwo Ti ID aes 


i:yi€ (£n -1,€n] 


11/35 


vt. of India 


. . Ag athshala 
Bayesian Histogram ) STS RIT AT 


> The posterior distribution of 7 is 


Th 
p(n|y.a) « Ts" ! II a E 
iyi€(£n ien] m 
k 


x [ls antn,—1 P = Dirichlet(a + n1, ..., aj, 4- ny) 


where nj, = >>; le, j«y;«t, IS the number of observations 
falling in the ht” histogram bin. 


MHRD 


Hil Govt. of India 


@Qathshala 
Bayesian Histogram QUST 


> The posterior distribution of 7 is 


k 
apn—l Th 
p(m|y,a) X T ———— 
PS S pus UE amea 


i:yi € (£n -1,€n] 


k 
zx E ees 
ox II 55^ *"»— = Dirichlet(ay + m1, ..., ay + nx) 
h=1 


where nj, = >>; le, 1 <y;<e, IS the number of observations 
falling in the ht” histogram bin. 


» The Bayesian histogram estimator does an adequate job 
approximating the true density, but the results are sensitive to 
the number and locations of knots. 


MHRD 
JJ Govt. of India 
Q2 


. . Ak athshala 
Bayesian Histogram ) STS RIT AT 


aiia dh 


> The Bayesian histogram estimator does an adequate job 
approximating the true density, but the results are sensitive to 
the number and locations of knots. 


13/35 


MHRD 


Iil Govt. of India 


@Qathshala 
Bayesian Histogram © e ITOSITeIT 


» The Bayesian histogram estimator does an adequate job 
approximating the true density, but the results are sensitive to 
the number and locations of knots. 


> This approach allows prior information to be included and 
allows easy production of interval estimates, and hence has 
some practical advantages over classical histogram estimators. 


13/35 


Ë Jathshala 
A 


Bayesian Histogram “HUTS AT 


> The Bayesian histogram estimator does an adequate job 
approximating the true density, but the results are sensitive to 
the number and locations of knots. 


> This approach allows prior information to be included and 
allows easy production of interval estimates, and hence has 
some practical advantages over classical histogram estimators. 


> The Dirichlet prior distribution is perhaps not the best choice 
due to the lack of smoothing across adjacent bins, but it does 
have the advantage of conjugacy and simplicity in 
interpretation of the hyperparameters. 


13/35 


Dirichlet Process 


> A Dirichlet Process is a distribution over distributions. 


14/35 


vt. of India 


athshala 
Dirichlet Process Getestet 


> A Dirichlet Process is a distribution over distributions. 


> Let G be Dirichlet Process distributed: 


G~ P(a, Go) 


1. Go is a base distribution 


14 / 35 


208 Jathshala 
Dirichlet Process v 9 UTSATT 


> A Dirichlet Process is a distribution over distributions. 


» Let G be Dirichlet Process distributed: 


G~ P(a, Go) 


1. Go is a base distribution 


2. a is a positive scaling parameter 


14/35 


athshala 
Dirichlet Process ECan 1 


> A Dirichlet Process is a distribution over distributions. 


> Let G be Dirichlet Process distributed: 


G~ P(a, Go) 


1. Go is a base distribution 


2. a is a positive scaling parameter 


> G is a random probability measure that has the same support 
as Go. 


14/35 


t. of India 


athshala 
Dirichlet Process estet 


> Consider Gaussian Go 


0.2 03 0.4 


0.0 0.1 


thshal 
Dirichlet Process eder as e 


>» G^ DP(a, Go) 


0.2 03 0.4 


0.0 0.1 


Dirichlet Process 


> Go is continuous, so the probability that any two samples are 
same is zero. 


17/35 


MHRD 
JJ Govt. of India 
Q2 


athshala 
Dirichlet Process etse 


bm 


> Go is continuous, so the probability that any two samples are 
same is zero. 


» However, G is a discrete distribution, made up of a countably 
infinite number of point masses 


17/35 


athshala 
Dirichlet Process Qüsteem t 


> Go is continuous, so the probability that any two samples are 
same is zero. 


» However, G is a discrete distribution, made up of a countably 
infinite number of point masses 


» Hence, there is a non-zero probability of two samples colliding 


17/35 


Sample from Dirichlet Processes 


>» G~ DP(a,Go) 


18/35 


Sample from Dirichlet Processes 


> G~ DP(a,Go) 


> Xi|G ~ G for i = (1,2,..., n) (iid given G) 


18/35 


Sample from Dirichlet Processes 


» G ~ DP(o, Go) 
> Xi|G ~ G for i = (1,2,..., n) (iid given G) 


» Marginalizing out G introduces dependencies between X; 


18/35 


athshala $ RD 
Sample from Dirichlet Processes oat a” 


> G~ DP(a,Go) 
> Xi|G ~ G for i = {1,2,...,n} (iid given G) 
» Marginalizing out G introduces dependencies between X; 


PUG Xə,- X »= [ror P(X;|G)aG 


18/35 


Ë Jathshala 
Samples from a Dirichlet Process +9 QTOSITeIT 


» Assume we view these variables in a specific order and are 
interested in the behavior of X, given the previous n — 1 
observations. 


19 / 35 


MHRD 
JJ Govt. of India 
Q2 


athshala 
Samples from a Dirichlet Process Getestet 


bm 


» Assume we view these variables in a specific order and are 
interested in the behavior of X, given the previous n — 1 
observations. 


Xi with prob——— 
new draw from Go with prob——*— 


Xp|X1, X2, sp el = { i ims 


n— Tra 


19 / 35 


athshala 
Samples from a Dirichlet Process Qüsteem t 


» Assume we view these variables in a specific order and are 
interested in the behavior of X, given the previous n — 1 
observations. 


Xi with prob——— 
new draw from Gg with prob——*— 


X7|X1, X», sp hal = { i ims 


n— Tra 


> Let there be K unique values for the variables: 


Xj for k € {1,2,..., K} 


19 / 35 


a Jathshala 
Samples from a Dirichlet Process © H ITOSITeIT 


Xi with prob —-—— 


Xn|X1, Xa, ., Xii = l new draw from Go with prob 


vt. of India 


n— Ta 
n— rm 


athshala RD 
Samples from a Dirichlet Process ettet a» 


u Xi with prob —L— Ifa 
Xs, X2, Xi = l new draw from Go with prob; -fra 
P(X,, Xy, K = ML ANN 
i=l 
a wire AXE 
= Go(X;) 
a(a+1)...(a z n — 1) Sl oQXE) 


Ëk athshala 
Q 


Samples from a Dirichlet Process )uTOSITeT diss 


n— ira 


Xj with prob —;— 
Xn| X1, X2, dp dad T { j i 


new draw from Go with prob-—1— 


P(X,, Xy, X4) = propao 
{=l 
a ETT YiGECXE 


~ e(o4 1).. MD [Laux 


> Notice that the above formulation of the joint distribution 
does not depend on the order we consider the variables. 


20 / 35 


Ë Jathshala 
Samples from a Dirichlet Process © H ITOSITeIT 


Xi with prob —-—— 


Xn|X1, X2; ., Xii = { new draw from Go with prob-—+— 


vt. of India 


n— ira 
"IS 


t. of India 


athshala 
Samples from a Dirichlet Process estet 


Xi with prob —L— 


Xn[X1, X2, Xni { new draw from Go with prob 


eol 
n—1+a 


> Let there be K unique values for the variables: 


X; for k € {1,2,..., K} 


athshala 
Samples from a Dirichlet Process estet 


Zo sae dina = { new draw from Go with prob>— 
> Let there be K unique values for the variables: 
Xj for k € {1,2,..., K} 
* ; #n-1(X§) 
Xn| Xi, X2, .., Xni = c zii xd flra 
new draw from Go with prob-—S_— 


Blackwell-MacQueenUrn Scheme 


> G~ DP(a,Go) 


Blackwell-MacQueenUrn Scheme 
> G ~ DP(a,Go) 


> X,|G~G 


vt. of India 


Ë Jathshala 
Blackwell-MacQueenUrn Scheme © 9 YTS AT 
> G~ DP(a,Go) 
> X,|G~G 


» Assume that Go is a distribution over colors, and that each 
Xn represents the color of a single ball placed in the urn. 


208 Jathshala 
Blackwell-MacQueenUrn Scheme v 9 UTSATT 


> G~ DP(a,Go) 
> X,|G~G 


» Assume that Go is a distribution over colors, and that each 
Xn represents the color of a single ball placed in the urn. 


> Start with an empty urn. 


athshala 
Blackwell-MacQueenUrn Scheme Qüsteer t 


> G~ DP(a,G) 


v 


Xda eG 


v 


Assume that G is a distribution over colors, and that each 
Xn represents the color of a single ball placed in the urn. 


v 


Start with an empty urn. 


M 


On step n: 
> With probability proportional to a, draw Xn ~ Go, and add a 
ball of that color to the urn. 


MHRD 
Hil Govt. of India 


Ë Jathshala 
Blackwell-MacQueenUrn Scheme © 9 YTS AT 


>» G~ DP(a, Go) 


v 


Xda eG 


v 


Assume that G is a distribution over colors, and that each 
Xn represents the color of a single ball placed in the urn. 


v 


Start with an empty urn. 


v 


On step n: 
> With probability proportional to a, draw Xn ~ Go, and add a 
ball of that color to the urn. 


» With probability proportional to n — 1 (i.e., the number of 
balls currently in the urn), pick a ball at random from the urn. 
Record its color as Xn, and return the ball into the urn, along 
with a new one of the same color. 


Ref; Blackwell and MacQueen (1973) 


vt. of India 


Ë Jathshala 
Chinese Restaurant Process ~ H UTOSITelIT 


* H #n-1 (Xk) 
Xal Xi, X2, + Xn-1 = Xi mu probra 
new draw from Go with prob—7— 


MHRD 
JJ Govt. of India 
QU 


athshala 
Chinese Restaurant Process steer 


mm 


* H #n-1 (Xk) 
Xal Xi, X2, +, Xn-1 = x; sek probra 
new draw from Go with prob—7— 


» Consider a restaurant with infinitely many tables, where the 
Xn s represent the patrons of the restaurant. 


MHRD 


Hi Govt. of India 


a Jathshala 
Chinese Restaurant Process € S UTOSITelIT 


x with prob ^»-105) 


n—1+a 


new draw from Go with prob——S-— 


Xp|X1, X2; wey Xn-1 = 


> Consider a restaurant with infinitely many tables, where the 
Xn s represent the patrons of the restaurant. 


> From the above conditional probability distribution, we can 
see that a new customer is more likely to sit at a table if there 
are already many people sitting there. 


MHRD 


Hi Govt. of India 


Ë Jathshala 
eS WOM AT 


Chinese Restaurant Process 


x with prob ^»-105) 


n—1+a 


new draw from Go with prob——S-— 


Xp|X1, X2; wey Xn-1 = 


> Consider a restaurant with infinitely many tables, where the 
Xn s represent the patrons of the restaurant. 


> From the above conditional probability distribution, we can 
see that a new customer is more likely to sit at a table if there 
are already many people sitting there. 


» However, with probability proportional to o, the customer will 
sit at a new table. 


Ë Jathshala 
eS WOM AT 


Chinese Restaurant Process 


Xy with prob 2-1k) 


n—1+a 


new draw from Go with prob——S-— 


Xp|X1, X2; wey Xn-1 = 


> Consider a restaurant with infinitely many tables, where the 
Xn s represent the patrons of the restaurant. 


> From the above conditional probability distribution, we can 
see that a new customer is more likely to sit at a table if there 
are already many people sitting there. 


» However, with probability proportional to o, the customer will 
sit at a new table. 


> This is known as "Clustering effect" 


Stick Breaking 


» So far, we have discussed properties of a distribution G drawn 
from a Dirichlet Process 


vt. of India 


Ak athshala 
Stick Breaking ) STS RIT AT 


» So far, we have discussed properties of a distribution G drawn 
from a Dirichlet Process 


» |n 1994, Sethuraman developed a constructive way of forming 
G, known as - "stick breaking" 


t. of India 


@Qethshala 
Stick Breaking +9 ITOSITeIT 


> So far, we have discussed properties of a distribution G drawn 
from a Dirichlet Process 


> In 1994, Sethuraman developed a constructive way of forming 
G, known as - "stick breaking" 


> "Stick breaking" helped practitioner to implement the Dirichlet 
Process in a real way. 


t. of India 


Stick Breaking 


> Vi, Vo... Vi... Beta(1, a) 


Stick Breaking 


> Vi, Vo... Vi,- ~ Beta(1, a) 


> f(Vi = vila) 2o(1—v)*! 


Stick Breaking 


> Vi, V2, ..., Vi,- ~ Beta(1, a) 
> f(Vi = vila) = o(1— vj)^! 


TE e 


Stick Breaking 


Vi, Vo, ..., Vi... ~ Beta(1, a) 


M 


M 


f(Vi = vila) = a(1 — vj)! 


M 


XT, X3, pat um fen Go 


> Ti(v) = vi Ip — vj) 


Stick Breaking 


Vi, Vo, ..., Vi, ... ~ Beta(1, a) 


M 


M 


f(Vi = vila) = a(1 — vj)7t 


M 


XT, X3, sega pisc Go 
> mi(v) — vi Ip = vj) 


> G= Jic nx; 


Stick Breaking 


> Draw XT from Go 


Stick Breaking 


> Draw Xj from Go 


> Draw vı from Beta(1, a) 


Stick Breaking 


> Draw Xj from Go 


> Draw vı from Beta(1,o) 


> Tı =U, 


Stick Breaking 


M 


Draw XT from Go 


M 


Draw v; from Beta(1, a) 


> Tı =U, 


v 


Draw X5 from Go 


Stick Breaking 


M 


Draw XT from Go 


M 


Draw v; from Beta(1, a) 


> Tı =U, 


M 


Draw X5 from Go 


> Draw v2 from Beta(1,o) 


Stick Breaking 


> Draw XT from Go 


> Draw vı from Beta(1,o) 


> Ti =U, 


> Draw X3 from Go 


> Draw v2 from Beta(1, a) 


> T2 = v2(1 = U1) 


Stick Breaking 


> Let o be a positive, real-valued scalar 


Stick Breaking 


> Let o be a positive, real-valued scalar 


> Let Go be a non-atomic probability distribution over support 
set A 


Stick Breaking 


» Let o be a positive, real-valued scalar 


> Let Go be a non-atomic probability distribution over support 
set A 


> If G~ DP(a, Go), then for any finite set of partitions A: 
A, UAgU...U Ax 
then 


(G(A1), eeey G(Ax)) Ge Dirichlet(aGo(A1), T" Go(Ag)) 


athshala RD 
DP Mixture Model C TI as * 


> general kernel mixture model 


JOP) = / (yl@)dP() 


where (|0) is a kernel with 0 include location and scale 
parameters 


GE athshala 
DP Mixture Model €' S OTST AT 


» general kernel mixture model 


FUP) = [ WDP) 


where (|0) is a kernel with 0 include location and scale 
parameters 


> a DP prior on P leads to 
fy) = V mlu) 
h=1 


where 7 denote the probability weights are sampled from a 
DP stick-breaking process with parameter o, and with 
05 ~ Po independently for h = 1, 2,...00. 


DP Mixture Model 


> A key question is how to conduct posterior computation under 
the DP? 


athshala 
DP Mixture Model Reana f 


> A key question is how to conduct posterior computation under 
the DP? 


> This initially seems problematic in that the mixing measure P 
is characterized by infinitely many parameters 


MHRD 


Hi Govt. of India 


Ë Jathshala 
DP Mixture Model © S UTOSITeIT 


> A key question is how to conduct posterior computation under 
the DP? 


> This initially seems problematic in that the mixing measure P 
is characterized by infinitely many parameters 


> A clever way around this problem is to marginalize out P to 
obtain an induced prior distribution on the subject-specific 
parameters 0 = (04, ...,04) 


208 Jathshala 
Polya Urn Scheme «S UST 


> A clever way around this problem is to marginalize out P to 
obtain an induced prior distribution on the subject-specific 
parameters 0 = (04, ..., On). 


30 / 35 


t. of India 


Ë Jathshala 
Polya Urn Scheme +9 ITOSITeIT 


> A clever way around this problem is to marginalize out P to 
obtain an induced prior distribution on the subject-specific 
parameters 0 = (04, ..., On). 


> In particular, marginalizing out P, we obtain the Polya urn 
predictive rule, 


a 1 
p(648.. 6) ~ (<2) e Y (iin) " 


30 / 35 


vt. of India 


@Qethshala 
Polya Urn Scheme +9 ITOSITeIT 


> A clever way around this problem is to marginalize out P to 
obtain an induced prior distribution on the subject-specific 
parameters 0 = (04, ..., On). 


> In particular, marginalizing out P, we obtain the Polya urn 
predictive rule, 


i—1 
a l 
DUET E E 
(Cia, 8) ~ (5) ri DP ee , 


> Chinese restaurant process metaphor is commonly used in 
describing the Polya urn scheme. 


30 / 35 


vt. of India 


athshala 
Blocked Gibbs Sampler steer 


» By marginalizing out the random probability measure P, we 
give up the ability to conduct inferences on P. 


31 / 35 


e Jathshala 
© S WOM AT 


vt. of India 


Blocked Gibbs Sampler 


» By marginalizing out the random probability measure P, we 
give up the ability to conduct inferences on P. 


> By having approaches that avoid marginalization, we open the 


door to generalizations of DPMs for which marginalization is 
not possible analytically. 


31 / 35 


vt. of India 


@Qethshala 
Blocked Gibbs Sampler © QUST 


> By marginalizing out the random probability measure P, we 
give up the ability to conduct inferences on P. 


> By having approaches that avoid marginalization, we open the 
door to generalizations of DPMs for which marginalization is 
not possible analytically. 


» One approach for avoiding marginalization is to rely on the 
construction 


f(y) = 9 mlu) 
h=1 


31 / 35 


Ë Jathshala 
A 


Blocked Gibbs Sampler QUST 


> By marginalizing out the random probability measure P, we 
give up the ability to conduct inferences on P. 


> By having approaches that avoid marginalization, we open the 
door to generalizations of DPMs for which marginalization is 
not possible analytically. 


» One approach for avoiding marginalization is to rely on the 
construction 


f(y) =X mlu) 
h=1 


» Because the stick-breaking construction orders the mixture 
components so that the weights are stochastically decreasing 
in the index A, for a sufficiently high index IN, we will have 
that 3 Y, Ta has a distribution concentrated near zero. 


31 / 35 


Ë Jathshala 
A 


Blocked Gibbs Sampler QUST 


> By marginalizing out the random probability measure P, we 
give up the ability to conduct inferences on P. 


> By having approaches that avoid marginalization, we open the 
door to generalizations of DPMs for which marginalization is 
not possible analytically. 


» One approach for avoiding marginalization is to rely on the 
construction 


f(y) =X mlu) 
h=1 


» Because the stick-breaking construction orders the mixture 
components so that the weights are stochastically decreasing 
in the index A, for a sufficiently high index IN, we will have 
that 3 Y, Ta has a distribution concentrated near zero. 


31 / 35 


Ei Jathshala b 
Blocked Gibbs Sampler GGorsere os ot 


» Using the stick-breaking truncation, the following blocked 
Gibbs sampler can be used: 


athshala 
Blocked Gibbs Sampler Gees di 


> Using the stick-breaking truncation, the following blocked 
Gibbs sampler can be used: 
1 Update S; € (1,2,..., N} by multinomial sampling with 
n» (yi|07) 
x 17h! (uil) 
where S; = h if 0; = 07 denotes that subject i is allocated to 
cluster A. 


Pr(Si = h|-) = 


athshala 
Blocked Gibbs Sampler e LLLI dis" 


> Using the stick-breaking truncation, the following blocked 
Gibbs sampler can be used: 


1 Update S; € (1,2,..., N} by multinomial sampling with 


7»(vi|87.) 
es 1 Th! (uil67,) 
where S; = h if 0; = 07 denotes that subject i is allocated to 
cluster h. 
2 Update the stick-breaking weight Vp, h = 1,2,..., N — 1 from 
Beta(1-F n, 0 - engi Mn’): 


Pr(Si = h|-) = 


Blocked Gibbs Sampler 


mÓ 


athshala 
GE UTS Mell 


Using the stick-breaking truncation, the following blocked 
Gibbs sampler can be used: 


Update S; € (1,2,..., N} by multinomial sampling with 


v» (yi|07) 
s 1 Th! (yi|67,) 


where S; = h if 0; = 07 denotes that subject i is allocated to 
cluster h. 

Update the stick-breaking weight Vp, h = 1,2,..., N — 1 from 
Beta(1-F n, a -- 5 a 44 Mn’): 

Update 07 , h = 1,2,..., N, exactly as in the finite mixture 
model, with the parameters for unoccupied clusters with 

nj — 0 sampled from the prior P5. 


Pr(Si = h|-) = 


Ak athshala 


Blocked Gibbs Sampler 9 uTO SITeIT 


> Using the stick-breaking truncation, the following blocked 
Gibbs sampler can be used: 
1 Update S; € (1,2, ..., N} by multinomial sampling with 


v» (yi|07) 
s 1 Th! (yi|07,) 


where S; = h if 0; = 07 denotes that subject i is allocated to 
cluster h. 

2 Update the stick-breaking weight Vp, h = 1,2,..., N — 1 from 
Beta(1-F n, 0 +E engi Mh’): 

3 Update 07 , h = 1,2,..., N, exactly as in the finite mixture 
model, with the parameters for unoccupied clusters with 
nj — 0 sampled from the prior P5. 


Pr(Si = h|-) = 


> This algorithm involves simple sampling steps and is 
straightforward to implement. 


a Jathshala 
os UTSATT ELJ i 


Application 


» Consiter faithful dataset 


Histogram of y 


40 


Frequency 
20 


33 / 35 


vt. of India 


a Jathshala 
Application © H ITOSITeIT 


» Consiter faithful dataset 


34 / 35 


vt. of India 


a Jathshala 
Application © H ITOSITeIT 


» Consiter faithful dataset 


34 / 35 


vt. of India 


Ë Jathshala 
Application © H ITOSITeIT 


> Consiter faithful dataset 


» Clearly it is a bimodal data. 


34 / 35 


@Qathshala 
Application SHST ay” 


Dirichlet Process Fit 


0.6 0.8 


Density 
0.4 


0.2 


0.0 


1.5 2.5 3.5 4.5 


eruptions 


35 / 35 


athshala 
GR |) OTST 


A Gateway to all Postgraduate Courses An MHRD project under its National Mission on Education through ICT (NME-ICT) 


Subject: Statistics 


Paper: Statistical Inference 


Module: Gaussian Process Prior 
for Non-Parametric Regression 


MHRD 


Hi Govt. of India 


@Qethshala 
Development Team eS UST 


Principal investigator: Dr. Bhaswati Ganguli, Professor, 
Department of Statistics, University of Calcutta 

Paper co-ordinator: Dr. Dipak K Dey, Associate Dean and BOT 
Distinguished Professor, Department of Statistics, 
University of Connecticut 

Content writer: Dr. Sourish Das, Assistant Professor, Chennai 
Mathematical Institute 


Content reviewer: Department of Statistics, University of Calcutta 


Outline 


1. Introduction to Gaussian Process 


Outline 


1. Introduction to Gaussian Process 


2. Function-Space View 


Outline 


1. Introduction to Gaussian Process 


2. Function-Space View 


3. Application 


Introduction 


» The Gaussian process is an infinite-dimensional generalization 
of the Gaussian distribution. 


athshala 
Introduction estet 


» The Gaussian process is an infinite-dimensional generalization 
of the Gaussian distribution. 


» |t can be used to set as prior over unknown function. 


Hin 
Introduction e 


> The Gaussian process is an infinite-dimensional generalization 
of the Gaussian distribution. 


> |t can be used to set as prior over unknown function. 


> In this module, we describe Gaussian process methods for 
regression problems. For more detail see Rasmussen and 
Williams (2006). 


Introduction 


> There are several ways to interpret Gaussian process (GP) 
regression models. 


e Jathshala 
v. 


Introduction S) UTS AT 


> There are several ways to interpret Gaussian process (GP) 
regression models. 


> One can think of a Gaussian process as defining a distribution 
over functions, and inference taking place directly in the space 
of functions, the function-space view. 


a Jathshala 
Introduction © S UTOSITelIT 


> There are several ways to interpret Gaussian process (GP) 
regression models. 


> One can think of a Gaussian process as defining a distribution 
over functions, and inference taking place directly in the space 
of functions, the function-space view. 


> An alternate equivalent view is weight-space view. 


Ë Jathshala 
Weight-space view QUST 


> We have a training set D = {(2;, y;)|i = 1, 2, ..., n), where x 
denotes an input vector (covariates) of dimension D. 


MHRD 


JJ Govt. of India 
Q2 


Gestern 


bm 


Weight-space view 


> We have a training set D = {(2;, y;)|i = 1,2, ..., n), where x 


denotes an input vector (covariates) of dimension D. 


> y denotes a scalar output or target (dependent variable); 


MHRD 
J]!N Govt. of India 
E0, 


: . Ë Jathshala 
Weight-space view QUST 
> We have a training set D = {(2;, y;)|i = 1,2,...,n}, where x 
denotes an input vector (covariates) of dimension D. 
> y denotes a scalar output or target (dependent variable); 


> the column vector inputs for all n cases are aggregated in the 
D x n design matrix X and the targets are collected in the 
vector y 


MHRD 
Iil Govt. of India 
RO, 


@Qathshala 
Weight-space view QUST 


> We have a training set D = {(2;, y;)|i = 1,2, ..., n), where x 
denotes an input vector (covariates) of dimension D. 


> y denotes a scalar output or target (dependent variable); 


M 


the column vector inputs for all n cases are aggregated in the 
D x n design matrix X and the targets are collected in the 
vector y 


> so we can write D = {X, y} 


athshala 
Weight-Space View Qüsteem t 


» standard linear regression model with Gaussian noise 


f())-at'e, y= f(z)+e, 


where z is the input vector, w is a vector of weights 
(parameters) of the linear models, f is the function value and 
y is the observed target value 


athshala 
Weight-Space View Qüseem t 


» standard linear regression model with Gaussian noise 


f())-at'e, y= f(z)+e, 


where z is the input vector, w is a vector of weights 
(parameters) of the linear models, f is the function value and 
y is the observed target value 


e ~ N(0,07) 


MHRD 
Hil Govt. of India 
isto) 


@Qethshala 
Weight-Space View +9 TOSITeIT 


> standard linear regression model with Gaussian noise 


f(a)=a7w, y= f(z)+e, 


where z is the input vector, w is a vector of weights 
(parameters) of the linear models, f is the function value and 
y is the observed target value 


e ~ N(0,07) 


> This noise assumption together with the model directly gives 
rise to the likelihood. 


@Qethshala 
Weight-Space View ~ H ITOSITeIT 


> The probability density of the observations given the 
parameters, which is factored over cases in the training set 
(because of the independence assumption) to give 


p(y|X, w) 


TL 
BECA 
i=l 

TL 


1 T 
= ex 
[I On V 27 5 í 202 


i=1 


i Ë e Tana 
Weight-Space View v 


» |n the Bayesian setup we should specify a prior over the 
parameters, before we look at the observations. 


athshala 
Weight-Space View Qüsteem t 


» |n the Bayesian setup we should specify a prior over the 
parameters, before we look at the observations. 


> We put a zero mean Gaussian prior with covariance matrix Xp 
on the weights 


w ~ N (0, 3). 


Ë Jathshala 
s A UTS 9ITelT 


t. of India 


Weight-Space View 


> In the Bayesian setup we should specify a prior over the 
parameters, before we look at the observations. 


> We put a zero mean Gaussian prior with covariance matrix Xp 


on the weights 
w ~ N (0, dp). 


> Inference in the Bayesian linear model is based on the 
posterior distribution posterior over the weights, computed by 
Bayes’ rule 


likelihood x prior 


osterior = 
4 marginal likelihood 


_ Py X, w)plw) 
PURE R) 


where p(y|X) = f p(y|X,w)p(w)dw 


t. of India 


athshala 
Weight-Space View Geste 


> The form of the posterior distribution as Gaussian with mean 
© and covariance matrix A^! 


p(w|X,y) ~N(@, A77) 


where ù = XA Xy and A! = o}? XXT + à 


10/31 


athshala 
Weight-Space View Qüsteem t 


> |n a non-Bayesian setting the negative log prior is sometimes 
thought of as a penalty term, and the maximum a posteriori 
(MAP) estimate point is known as the penalized maximum 
likelihood estimate of the weights. 


11/31 


Ë Jathshala 
oS UTOSITeIT 


Weight-Space View 


» |n a non-Bayesian setting the negative log prior is sometimes 
thought of as a penalty term, and the maximum a posteriori 
(MAP) estimate point is known as the penalized maximum 
likelihood estimate of the weights. 


> The penalized maximum likelihood procedure is known in this 
case as ridge regression [Hoerl and Kennard, 1970] because of 
ridge regression the effect of the quadratic penalty term 
jut 3, lu from log-prior 


11/31 


208 Jathshala 
Weight-Space View «S UST 


> To make predictions for a test case we average over all 
possible parameter values, weighted by their posterior 
probability. 


athshala 
Weight-Space View Getestet 


> To make predictions for a test case we average over all 
possible parameter values, weighted by their posterior 
probability. 


> Thus the predictive distribution for f, = f(x.) at x, is given 
by averaging the output of all possible linear models w.r.t. the 
Gaussian posterior 


(flos = f E T 


e xT A`! Xy, zT AT ien) 
c? 


TL 


t. of India 


a Jathshala 
Projections of Inputs into Feature Space *e'S'UTOSITelT 


> Bayesian linear model suffers from limited expressiveness. 


13 / 31 


Ak athshala 
Projections of Inputs into Feature Space ) UTOSITeIT 


> Bayesian linear model suffers from limited expressiveness. 


> simple idea to overcome this problem is to first project the 
inputs into some high dimensional space using a set of basis 
feature space functions 


13 / 31 


Jathshala 


= . " a d: MHRD 
Projections of Inputs into Feature Space © e UST 


JJ Govt. of India 
RO, 


> Bayesian linear model suffers from limited expressiveness. 


> simple idea to overcome this problem is to first project the 
inputs into some high dimensional space using a set of basis 
feature space functions 


> and then apply the linear model in this space instead of 
directly on the inputs themselves. 


13 / 31 


Projections of Inputs into Feature Space 


v 


MHRD 


Hil Govt. of India 


é Jathshala 
eS UTOSITeIT 


Bayesian linear model suffers from limited expressiveness. 


simple idea to overcome this problem is to first project the 
inputs into some high dimensional space using a set of basis 
feature space functions 


and then apply the linear model in this space instead of 
directly on the inputs themselves. 


For example, a scalar input x could be projected into the 
space of powers of x: ó(z) = (1,2,2?,2?,...) to implement 
polynomial polynomial regression. 


13 / 31 


08 Jathshala 
Projections of Inputs into Feature Space ‘eS UTS&iTcT 


> As long as the projections are fixed functions (i.e. 
independent of the parameters w) the model is still linear in 
the parameters, and therefore analytically tractable 


14/31 


e Jathshala 
Projections of Inputs into Feature Space e OUTST 


> As long as the projections are fixed functions (i.e. 
independent of the parameters w) the model is still linear in 
the parameters, and therefore analytically tractable 


> Specifically, we introduce the function ó(x) which maps 
D-dimensional input vector x into an N dimensional feature 
space. 


14/31 


Projections of Inputs into Feature Space 


MHRD 


Hi Govt. of India 


a Jathshala 
oS UTOSITeIT 


> As long as the projections are fixed functions (i.e. 


independent of the parameters w) the model is still linear in 
the parameters, and therefore analytically tractable 


Specifically, we introduce the function (x) which maps 
D-dimensional input vector x into an N dimensional feature 
space. 


Suppose (X) be aggregation of column (x) for all cases in 
the training set. Now the model is 


where the vector of parameters now has length N 


14 / 31 


a Jathshala 
Projections of Inputs into Feature Space *e'S'UTOSITelT 


vt. of India 


» The analysis for this model is analogous to the standard linear 
model, except that everywhere ^ ( X) is substituted for X. 


15 / 31 


Ë Jathshala RD 
Projections of Inputs into Feature Space SIUSA ay” 


> The analysis for this model is analogous to the standard linear 
model, except that everywhere ®(X) is substituted for X. 


> Thus the predictive distribution becomes 
1 = = 
flee X nN (oso A y i) A*60) ) 


where ® = (X) and A = 0,2997 + X7! 


15 / 31 


athshala 


HA 
Projections of Inputs into Feature Space O 


> The analysis for this model is analogous to the standard linear 
model, except that everywhere ®(X) is substituted for X. 


> Thus the predictive distribution becomes 
1 = = 
flee X nN (oso A y i) A*60) ) 
where ® = (X) and A = 0,2997 + X7! 


> To make predictions using this equation we need to invert the 
A matrix of size N x N which may not be convenient if N, 
the dimension of the feature space, is large. 


15 / 31 


a Jathshala RD 
Projections of Inputs into Feature Space — *eSWrTOSITeT d" 


» We can rewrite the predictive distribution as 


felt, X, y ~ N(MYXQG(K-o02) ly, 
Pr Xp. — Py Dp®(K 07) PT Eph.) 
(2) 


where (xx) = $s and defined K = PTX Ë 


16 / 31 


a Jathshala RD 
Projections of Inputs into Feature Space «S WTOSITeIT d» 


» We can rewrite the predictive distribution as 
falta, X,y ~ N(IIESE(K +07) y, 
Pr Xp. — Py Dp®(K 07) 91,9.) 
(2) 
where (xx) = ds and defined K = PTX Ë 


> In equation (2) the feature space always enters in the form of 
$1X,0, 91,0, or 91,9. 


16/31 


Ë Jathshala RD 
Projections of Inputs into Feature Space COWS ITA ay" 


> We can rewrite the predictive distribution as 
falta, X y ~ N(IIYESB(K + on) y, 
Pr Xp. — Py Dp®(K 07) PT Eph.) 
(2) 
where (xx) = ds and defined K = PTX Ë 


> In equation (2) the feature space always enters in the form of 
$1Y,0, O7D,®, or 91,9, 


> The entries of these matrices are invariably of the form 


(x) Xy 9(X") 


16 / 31 


vt. of India 


e Jathshala 
Projections of Inputs into Feature Space *e'S'UTOSITelT 


> Define k(x, x’) = ó(x)T YX5o(x') known as a covariance 
function or kernel function. 


17 / 31 


a Jathshala 
Projections of Inputs into Feature Space *e'S'UTOSITelT 


vt. of India 


> Define k(x, x’) = ó(x)TYXo(x') known as a covariance 
function or kernel function. 


> Notice tha ¢(x)?™,¢(z’) is an inner product with respect to 
Xx 


TEES 


Ë Jathshala 
Projections of Inputs into Feature Space © H JTOSITeIT 


> Define k(x, x’) = (<) Epo(x') known as a covariance 
function or kernel function. 


> Notice tha ¢(x)?=,¢(z’) is an inner product with respect to 
Ys 


> As Xp is positive definite we can define V(x) = Xl ela) we 
obtain a simple dot product representation 
k(x, x) = re) ran). 


TEES 


Rathara SE eno, 


Projections of Inputs into Feature Space 


> Define k(x, z^) = (<) X5 o(x') known as a covariance 
function or kernel function. 


> Notice tha ¢(x)?=,¢(z’) is an inner product with respect to 
Xx 


> As 5 is positive definite we can define V(x) = Dp ela) we 
obtain a simple dot product representation 
hos a y= e ran). 


> If an algorithm is defined solely in terms of inner products in 
input space then it can be lifted into feature space by 
replacing occurrences of those inner products by k(x, £o); this 
is sometimes called the kernel trick. 


TEES 


208 Jathshala 
Function-Space View v 9 UTOSITeIT 


» Definition: A Gaussian process is a collection of random 
variables, any finite number of which have a joint Gaussian 
distribution. 


18/31 


Function-Space View 


Ë Jathshala 


» Definition: A Gaussian process is a collection of random 


variables, any finite number of which have a joint Gaussian 
distribution. 


A Gaussian process is completely specified by its mean 
function and co- covariance and variance function. We define 
mean function m(x) and covariance function k(x, x’) of real 
process f(x) as 


mx) = E[f(«)] 
E[(f(z) 7 m(z))(f(z) - m(z)) (3) 


and will write the Gaussian process as 
f(x) ^ GP (m(x), k(a, a") (4) 


18/31 


Function-Space View 


> The random variables represent the value of the function f(x) 
at location zx. 


19 / 31 


ere athshala '"MHRD 


Function-Space View +9 ITOSITeIT 


> The random variables represent the value of the function f(x) 
at location zx. 


> A Gaussian process is defined as a collection of random 
variables. 


19 / 31 


d Jathshala 
Function-Space View v6 


> The random variables represent the value of the function f(x) 
at location zx. 


> A Gaussian process is defined as a collection of random 
variables. 


> The definition automatically implies a consistency 
requirement, which is also sometimes known as the 
marginalization property. 


19/31 


t. of India 


. . Ak athshala 
Function-Space View ) STS SITeIT 


> The marginalization property simply means that if the GP e.g. 
specifies (y1, y2) ~ N (ui, X), then it must also specify 
yi ~ AN (ui, X11) where X is the relevant submatrix of X 


@Qethshala 
Function-Space View SIST ay” 


> The marginalization property simply means that if the GP e.g. 
specifies (y1, y2) ~ N (ui, X), then it must also specify 
yi ~ AN (ui, X11) where X is the relevant submatrix of X 


> A simple example of a Gaussian process can be obtained from 
our Bayesian linear regression model f (a) = ọ(x)Tw with prior 
w ~ N (0, Up). We have for the mean and covariance 


E[f(2)] plz) Efu] = 0 
E[f(z)f(z)] = $(x)"Elww"]d(2’) = p(x)" X49(z^) 


Bethea! athshala '"MHRD 


Function-Space View e' e) ITOSITeIT 


> In this module we consider the squared exponential (SE) 
covariance function. A list of covariance function is presented 
later 


. . Ak athshala 
Function-Space View 9 uTO SITeIT 


> In this module we consider the squared exponential (SE) 
covariance function. A list of covariance function is presented 
later 


> The covariance function specifies the covariance between pairs 
of random variables 


cov( (y), f (4) = klp, T4) = exp 5lrp — zu) 


MHRD 


Hi Govt. of India 


Ë Jathshala 
Function-Space View © e UST aT 


> In this module we consider the squared exponential (SE) 
covariance function. A list of covariance function is presented 
later 


> The covariance function specifies the covariance between pairs 
of random variables 


cou Ftp), f(24)) = KGtp, z4) = exp[- 5 ltp — a) 


> Note, that the covariance between the outputs is written as a 
function of the inputs. 


i ËR athshala 
Function-Space View A UTOSITeIT 


> For squared error covariance function, we see that the 
covariance is almost unity between variables whose 
corresponding inputs are very close, and decreases as their 
distance in the input space increases. 


N 
8 
u 


Ë Jathshala 
© S UTOSITeIT 


Function-Space View 


> For squared error covariance function, we see that the 
covariance is almost unity between variables whose 
corresponding inputs are very close, and decreases as their 
distance in the input space increases. 


> |t can be shown that the squared exponential covariance 
function corresponds to a Bayesian linear regression model 
with an infinite number of basis functions. 


Ë Jathshala 
A 


Function-Space View QUST 


> For squared error covariance function, we see that the 
covariance is almost unity between variables whose 
corresponding inputs are very close, and decreases as their 
distance in the input space increases. 


> It can be shown that the squared exponential covariance 
function corresponds to a Bayesian linear regression model 
with an infinite number of basis functions. 


> For every positive definite covariance function k(,), there 
exists a (possibly infinite) expansion in terms of basis 
functions, known as Mercer’s theorem. 


Ë Jathshala 
A 


Function-Space View QUST 


> For squared error covariance function, we see that the 
covariance is almost unity between variables whose 
corresponding inputs are very close, and decreases as their 
distance in the input space increases. 


> It can be shown that the squared exponential covariance 
function corresponds to a Bayesian linear regression model 
with an infinite number of basis functions. 


> For every positive definite covariance function k(,), there 
exists a (possibly infinite) expansion in terms of basis 
functions, known as Mercer's theorem. 


» We generate a random Gaussian vector with this covariance 


matrix 


a Jathshala 
Function-Space View © e UST oT 


output, f(x) 
eo 
1 


Function-Space View 


> |n the graph above - four functions drawn at random from a 
GP prior 


08 Jathshala 
Function-Space View «S UST 


> |n the graph above - four functions drawn at random from a 
GP prior 


> Notice that “informally” the functions look smooth. 


. . Ag athshala 
Function-Space View ) STS SITeIT 


> |n the graph above - four functions drawn at random from a 
GP prior 


> Notice that “informally” the functions look smooth. 


> squared exponential covariance function is infinitely 
differentiable 


eRe TL 
Function-Space View v 


> |n the graph above - four functions drawn at random from a 
GP prior 

» Notice that "informally" the functions look smooth. 

> squared exponential covariance function is infinitely 
differentiable 

> The characteristic length-scale is around one unit. 


208 Jathshala 
Function-Space View «S UST 


> more realistic modelling situations that we do not have access 
to function values themselves, but only noisy versions 


y=f(x)t+e 


MHRD 


Hil Govt. of India 


@Qethshala 
Function-Space View eS UST 


> more realistic modelling situations that we do not have access 
to function values themselves, but only noisy versions 


y=f(x)t+e 


» Assuming additive independent identically distributed 
Gaussian noise €e with variance c2 the prior on the noisy 
observations becomes 


COU(Yp, Yq) = k(xp, x4) + o2 Og 
or 
cov(y) = K(X, X) +071 


where ôpq is kronecker delta which is one iff p = q and zero 
otherwise. 


i ÈR athshala RD 
Function-Space View JUTOSITeIT aay 


> Joint distribution of the observed target values and the 
function values at the test locations under the prior as 


mo ea o |): 


Ë Jathshala 
Function-Space View SHST ay” 


> Joint distribution of the observed target values and the 
function values at the test locations under the prior as 


[5 ]ev(o [| ET mex. 


> Predictive equations for Gaussian process regression 
PAIX. y. Xx Bi N (fx, cov(f.)) 


where 

f, = E[f.|X, y, Xs] = K(X,, XK(X, X) +2 | ly 
cov(f.) = 

K(X., X+) - K(X,, X)[K(X, X) + o21]- K(X, X.) 


a Jathshala 
Varying the Hyperparameters © S UST 


» The covariance functions that we use will have some free 
parameters. 


: Ak athshala 
Varying the Hyperparameters ) STS RIT AT 


> The covariance functions that we use will have some free 
parameters. 


> For example, the squared-exponential covariance function 
one dimension has the following form 


1 
ky(£p, £4) = oj exp (- 3 — s) + Gea 


n 


MHRD 
J]IN Govt. of India 
E0, 


@Qethshala 
Varying the Hyperparameters QUST 


> The covariance functions that we use will have some free 
parameters. 


> For example, the squared-exponential covariance function in 
one dimension has the following form 


1 
ky(£p, £q) = oj exp (- 3 = s") + Gna 


> The covariance is denoted E, as it is for the noisy targets y 
rather than for the underlying function f. 


MHRD 


Iil Govt. of India 


@Qathshala 
Varying the Hyperparameters QUST 


> The covariance functions that we use will have some free 
parameters. 


> For example, the squared-exponential covariance function in 
one dimension has the following form 


1 
ky(£p, £q) = oj exp (- 3 = s") + Gna 


> The covariance is denoted E, as it is for the noisy targets y 
rather than for the underlying function f. 


> Observe that the length-scale l, the signal variance oF and 


2 


á can be varied. 


noise variance c. 


Ë Jathshala 
A 


Varying the Hyperparameters "$ YTOSITeIT 


» The covariance functions that we use will have some free 
parameters. 


> For example, the squared-exponential covariance function in 
one dimension has the following form 


1 
ky(£p, £4) = oj exp (- 3 — s") + Gna 


> The covariance is denoted ką as it is for the noisy targets y 
rather than for the underlying function f. 


> Observe that the length-scale l, the signal variance oF and 


2 


á can be varied. 


noise variance c 


> In general we call the free parameters hyperparameters. 


Ë Jathshala 
A 


Varying the Hyperparameters "$ YTOSITeIT 


» The covariance functions that we use will have some free 
parameters. 


> For example, the squared-exponential covariance function in 
one dimension has the following form 


1 
ky(£p, £4) = oj exp (- 3 — s") + Gna 


> The covariance is denoted ką as it is for the noisy targets y 
rather than for the underlying function f. 


> Observe that the length-scale l, the signal variance oF and 


2 


á can be varied. 


noise variance c 


> In general we call the free parameters hyperparameters. 


Ë Jathshala 
Estimation of Hyperparameters eS UST 


> Estimation of these hyperparameters are an important steps. 


t. of India 


. . Ak athshala 
Estimation of Hyperparameters 9 uTO SITeIT 
> Estimation of these hyperparameters are an important steps. 


> The marginal likelihood is the integral of the likelihood times 
the prior 


p(ylX) = I pl f. X)p(f| X)df 


> The term marginal likelihood refers to the marginalization 
over the function values f. 


@Qethshala 
Estimation of Hyperparameters eS UTS TT 


Estimation of these hyperparameters are an important steps. 


The marginal likelihood is the integral of the likelihood times 
the prior 


p(ylX) = I p f. OPIA. 


The term marginal likelihood refers to the marginalization 
over the function values f. 


Under the Gaussian process model the prior is Gaussian, 


ylf e N( ,02I) or 


1 1 n 
log p(y| X,0,02) = -3V (Ketan)  y-5 log |Ko+onl|—5 log 27 


Ë Jathshala 
eS UTOSITeIT 


Estimation of Hyperparameters 


> Estimation of these hyperparameters are an important steps. 


> The marginal likelihood is the integral of the likelihood times 
the prior 


plylX) = | pif. Xop Good. 
> The term marginal likelihood refers to the marginalization 


over the function values f. 


» Under the Gaussian process model the prior is Gaussian, 


ylf e N( ,02I) or 
1 1 n 
log p(y|X,0,05) = — 59" (Ketan) 'y—5 log |Kotonl|—5 log 2n 
> This result can also be obtained directly by observing that 


y ~ N(0, Ko +071) 


t. of India 


log-Posterior for Hyperparameters 


> We can assume suitable hyper-prior on (0,02) as 
(0,07) = p(0)p(o7.) 


. Ak athshala RD 
log-Posterior for Hyperparameters JUTOSITeIT aay 


» We can assume suitable hyper-prior on (0,02) as 
(0,07) = p(0)p(o7.) 


> The log-posterior distribution is 


log p(0, o7 |y, X) x log p(y| X, 0,07) + log p(0) + logp(o7) 


. Ak athshala 
log-Posterior for Hyperparameters ) STS RIT AT 


» We can assume suitable hyper-prior on (0,02) as 
(0,07) = p(0)p(o7.) 


> The log-posterior distribution is 


log p(0, o7 |y. X) x log p(y| X, 0,07) + log p(0) + logp(o7) 


> The posterior mode is 


6 —argmax log p(6,02|y, X) 
0cO 
and 


o2 =argmax logp(8,o2|y, X) 
c2€R* 


208 Jathshala 
Estimated Curve e$ UTOSITeIT 


output, f(x) 
eo 
I 


08 Jathshala 
Estimated Curve with Cl S UTSATT 


output, f(x) 
eo 
I 


Gg un 


ITT 


Maximum Likelihood Estimation - The Basic Idea 
Module 1 


Saurav De 


Department of Statistics 
Presidency University 


Saurav De (Department of Statistics PresideMaximum Likelihood Estimation - The Basic 


ul 


D 


Maximum Likelihood Estimation - G Qathshaa 


The Basic Idea 


ə Let us begin with the following example ... 


ə Ex.1 A certain lion has three possible state of activities, namely Very 
active, Moderately active and Lethargic. 
So state of activity is the parameter, say @ which has three known 
levels, say 01 — Very active, 05 — Moderately active, and 05 — 
Lethargic. 


ə So we can say that the (known) parameter space Q = {01, 02,03}. 


Saurav De (Department of Statistics PresideMaximum Likelihood Estimation - The Basic 2/35 


Maximum Likelihood Estimation - eter di: 
The Basic Idea 


e Suppose, each night this lion eats X animals with probability 
Po[X = x] = f(x, 0), say. Then consider the following table. 

x 0 1 2 3 4 

Po [X =x] | 0 0.05 0.05 08 0.1 

Pa [X =x] | 0.05 0.05 08 01 0 

Po,[X = x] | 0.9 0.09 0.01 0 0 


Saurav De (Department of Statistics PresideMaximum Likelihood Estimation - The Basic 3/35 


Maximum Likelihood Estimation - GET ween 
The Basic Idea 


e From the above table we can say, if x is kept fixed at the value 4, 
then the most likely value of 0 is 0; (last column of the table). 


e If x is given as 1, then 63 is the most likely value of 0 (second column 
of the table), etc. 


Saurav De (Department of Statistics PresideMaximum Likelihood Estimation - The Basic 4/35 


Maximum Likelihood Estimation - atnsnara 


€ € WTS Tell 
The Basic Idea 


e The estimate of 0, so obtained, is well-known as the Maximum 
Likelihood (ML) Estimate of 0. 


e The method, applied to get such estimate, is known as the 
Maximum Likelihood Method of Estimation. 


Saurav De (Department of Statistics PresideMaximum Likelihood Estimation - The Basic 57/35 


Maximum Likelihood Estimation - GET ween 
The Basic Idea 


Let us now formalise this celebrated method of estimation through 
the following discussion. 


Saurav De (Department of Statistics PresideMaximum Likelihood Estimation - The Basic 69/135 


"U^ "E dits 
Maximum Likelihood Estimation - QQathishala 


The Basic Idea 


e Suppose (p(x, 0); 0 € Q} = A family of joint probability function 
(p.m.f. or p.d.f.) of (X1,...,Xn) characterised by the parameter 
0 = (03,...,0,);s > 1, and Q = the parameter space. 


e Given x, the corresponding family member becomes a function of 0, 
called the likelihood function of 0 given x and is usually denoted by 
Lx(0) or simply L(0); 0 € Q. 

e So in the above table rows give the p.m.f. values of the distribution 
of X under different fixed choices of 0; whereas the first column gives 
the values of L(0) for 0 = 01, 02,03 when x = 0, column 2 provides 
the same given x — 1 etc. 


Saurav De (Department of Statistics PresideMaximum Likelihood Estimation - The Basic 1:7 8 


Jathshala 


Maximum Likelihood Estimation - @Qothshata 
The Basic Idea 


Now any value of @ (within parameter space) for which the likelihood 
fuction attains its supremum, is known as the Maximum Likelihood 
Estimate of 0. So in the above example 6; is the ML estimate of 0 given 
x = 4, 05 is the ML estimate of 0 given x = 2 etc. 


8 / 35 


Saurav De (Department of Statistics PresideMaximum Likelihood Estimation - The Basic 


Maximum Likelihood Estimation - GQathanen 
The Basic Idea 


Thus the principle of maximum likelihood estimation consists of selecting 
as an estimator of 0 (real or vector-valued) a 6(X) (where 

X = (X1, X2,...,Xn)) that maximises the likelihood function L(A), or in 
other word, to find a mapping 6 of Rn — Q that satisfies 


L(0) = sup L(0). 


Such a ó, if exists, is called Maximum Likelihood Estimator. 


Saurav De (Department of Statistics PresideMaximum Likelihood Estimation - The Basic 9/135 


Maximum Likelihood Estimation - @Qethsnata d T 
The Basic Idea 


Some Introductory Examples 
Example A. Let X1, X2,...,Xn be iid R(0,0) variables; 0 € (0,00). 


Given x= (x1, X2,..-,Xn), the likelihood function 


L(0) = aS ESxE5i-L2. G0 € 0 2 Xn) 
0, otherwise 


where x(n) = max ba,» ..., Xn}. 


So here ÊMLE = Xn) € (0, oo) 


Saurav De (Department of Statistics PresideMaximum Likelihood Estimation - The Basic 10 / 35 


Maximum Likelihood Estimation - ERs ween 
The Basic Idea 


A note on Ex A. 


Let n = 2 i.e. X3, Xo ~ R(0,0) independently. 
e The joint pdf of X1, X2 is a bivariate function 


1 
f (x1, x2) 92 0 < X1; X2 < 0 


— 0 ow. 


Saurav De (Department of Statistics PresideMaximum Likelihood Estimation - The Basic 


11/35 


Maximum Likelihood Estimation - 
The Basic Idea 


ER ao 


TAT 


Surface Plot of f vs x2, x1 


Saurav De (Department of Statistics PresideMaximum Likelihood Estimation - The Basic 


m 


Du 


Maximum Likelihood Estimation - ERs ween 
The Basic Idea 


But the likelihood function for given xı, x2 is a univariate function 


1 
L(0) = gi XQ $80 


— 0 ow. 


The corresponding plot follows in the next slide. 


Saurav De (Department of Statistics PresideMaximum Likelihood Estimation - The Basic. is} 95 


IW Govt. of India 


Maximum Likelihood Estimation - S CETT 
The Basic Idea 


Scatterplot of L theta vs Theta 


Saurav De (Department of Statistics PresideMaximum Likelihood Estimation - The Basic 14/35 


Maximum Likelihood Estimation - 
The Basic Idea 


A similar type of illustration under normal family of distributions 
using R. 


R Code : 


library (TeachingDemos) 

m <- runif(1, 50,100)#initial value of mean 

@ <= ranlif(t, 1, 10)#initial value of SD 

x <- rnorm(15, m, s)#initial sample drawn from normal 
distribution 

mm <- mean(x) 

ss <- sqrt (var(x)) 

ss2 <- sqrt (var(x)*11/12) 

mle.demo(x)#log likelihood calculated here 


WEM tad 


MOMON N 


Saurav De (Department of Statistics PresideMaximum Likelihood Estimation - The Basic 15/35 


MHRD 


ovt. of India 


Maximum Likelihood Estimation-  @@athanale 
The Basic Idea 


a4 
"| Log Likelihood = -49.807 
E 
HEN 
$ o 
8. 
8 
3 AM 
e T T T T T 
80 90 100 110 120 
3 91 
$ g] 
Ae] 
57] 
iR T 
60 80 100 120 
3 8 
2 
$89 
E 
Bo 
3 4 5 6 


Figure: Log Likelihood when u = 94.6, o = 4.8 


Saurav De (Department of Statistics PresideMaximum Likelihood Estimation - The Basic 1635 


MHRD 


ovt. of India 


Maximum Likelihood Estimation-  @@athanale 
The Basic Idea 


“| Log Likelihood = -79.549 


Log Likelihood 
-80 -65 -50 
LLLLLA 


Log Likelihood 
80 -65 -50 


Figure: Log Likelihood when u = 88.8, o = 4.8 


Saurav De (Department of Statistics PresideMaximum Likelihood Estimation - The Basic 1735 


MHRD 


Jit, Govt. of India 


Maximum Likelihood Estimation-  @@athahale 
The Basic Idea _ 


. 
a4 
E m = 
Log Likelihood = -61.604 
mo 
E 
i. 
$ 3| 
E 
3 
d Ie 
E 
3 
a | 
e T T T T T 
80 90 100 110 120 
à 
g 91 
H 
BY 
8 ef 
- T T T T T T 
89 90 91 92 93 94 
" 
» 84 
Rog 
Sel —. | |ld—cisévco o99099* 
?.1$.. 027777 
3 muri 
50 55 60 65 70 


Figure: Log Likelihood when u = 88.8, o = 7.3 


Saurav De (Department of Statistics PresideMaximum Likelihood Estimation - The Basic 18\/ 35 


"we "MIN" Up 
Maximum Likelihood Estimation - ser 
The Basic Idea 

Ex B. Let X1, X2,..., X, be iid with common pdf 


fo(x) Sa" 0 4x &1 
= 0 otherwise; 0-0 


Then, given x, the likelihood function is 


L(8) — ex] pot 


= ((0) = log L(@) = nlog0 + (0 — Y log x; 


n 
= 8 + `> log xi 
i=1 


Saurav De (Department of Statistics PresideMaximum Likelihood Estimation - The Basic 19 / 35 


Maximum Likelihood Estimation - GET ween 
The Basic Idea 


By maxima-minima principle i5 (0) = Oi ĝ = ? — € (0, 00) 


—_—1— 
, log x; 
i=1 


Also S.O.C. 2 (0) = —gr < 0 V 0 > 0 ensures that Ê maximises the 
likelihood function and hence the MLE of 0. 


Saurav De (Department of Statistics PresideMaximum Likelihood Estimation - The Basic 20 / 35 


Maximum Likelihood Estimation - 


The Basic Idea 


Ex C. Let X1, X2,..., Xn be iid Bernoulli (0) random variables, 0 c [0, 1]. 
Then, given x, the likelihood function is 


a Jathshala 
eS USM aT 


n 
Let b =a. 
i=1 


Suppose 
Case I: 0<a<nie. 0c2«1 


Case Il: 2= 0 i.e. 2=0 


Case lll: 3—^ n i.e. 2—1 


Saurav De (Department of Statistics PresideMaximum Likelihood Estimation - The Basic 21 / 35 


Maximum Likelihood Estimation - A Ganha meede 
The Basic Idea 


Case 1:0<a<n 


E A) : 1-0 1—0 
For the n numbers %,..., ^* (a times) and pe. Pu oe (n—a 


times) AM-GM mE gives 


a% + (n DEA | (78 ?(n(1—50)V7? oe 
n E a n—a 


e eu- ey < (5) (1 = AY 


n 


Saurav De (Department of Statistics PresideMaximum Likelihood Estimation - The Basic 22 / 35 


Maximum Likelihood Estimation - 
The Basic Idea 


à Jathshala 
eS USM aT 


Case Il: 3— 0 


a=0 & x =0,/ =1,2,...,n which means L(@) = (1 — 6)" which is 
supremum at 6 = 0 = 4. 


Case Ill: a=n 


a—-nex = 1,i =1,2,...,n which means L(0) = 0" which is supremum 


atio —1- 3, 

n 
So under all the cases, L(0) atains its supremum at 0 = 2 c [0, 1]. 
= the MLE of 8 is Ê = a = X; the sample mean. 


Saurav De (Department of Statistics PresideMaximum Likelihood Estimation - The Basic 23 / 35 


Maximum Likelihood Estimation - Qianmen dae 
The Basic Idea 


Application 1. Let X — Hypergeometric distribution and given X — x, 
the likelihood function be 


a function of the parameter N; M and n being known. 


L(N N—n)(N—M 
Let R(N) = HAUS = A A 


Now, for values of N for which the ratio R(N) > 1, L(N) increases with N 
and L(N) is decresing in N for those N which results in R(N) « 1. 


Saurav De (Department of Statistics PresideMaximum Likelihood Estimation - The Basic 24 / 35 


Maximum Likelihood Estimation - GET ween 
The Basic Idea 


Let Ñ = N(X) be the MLE of N. Then 
R(Ñ) > (<)1 if and only if Ñ < (>) 


This implies that L(N) attains its maximum at N ~ 2. Thus 


^ 


N(X) = [54] where [x] denotes the largest integer contained in x. 


Saurav De (Department of Statistics PresideMaximum Likelihood Estimation - The Basic 25:/ -35 


Maximum Likelihood Estimation - a 
The Basic Idea 


Application 2. Let X1, X2,..., X, Ï common pdf fo(x) where 


2 
f(x) = = 0cx«0 
Q 
DER XO peg 
a(o — 8) 
= 0, otherwise, 


where a > 0 is a (known) constant and 0 € (0, a). Here the likelihood 


function of @ is T 
Xi Q — Xi 
5) = (2) II gli w= 


x;«ü x;>0 


Assume that 0 < x1) < X2) < ... < X(y) € q, i.e. the ordered 
observations . 


Saurav De (Department of Statistics PresideMaximum Likelihood Estimation - The Basic 26 / 35 


Maximum Likelihood Estimation - A Gaah meede 
The Basic Idea 


case I: xj) < 0 € X41) 


Then L(8) = (2)" 8i . ux To II (a — xy) 


i=j+1 


Check that the first-order and second-order condition ensure that the 
stationary value of 0 that exists must be a point of minimum, 
V jE A, 2, ..%nF 1. 


Saurav De (Department of Statistics PresideMaximum Likelihood Estimation - The Basic 27/35 


Maximum Likelihood Estimation - Ag ween 
The Basic Idea 


case Il: 0 € 0 < x 


Then L(0) = (2)" (a — 6)- Ie i 


Similarly under case III: x(n) < 0 < a, L(A) } 0. 


So the ML estimate of 0 is either x1 or x, depending on the observations 
and value of a. 


Note. This is an example where MLE is necessarily an actual observation, 
but not necessarily any particular observation. 


Saurav De (Department of Statistics PresideMaximum Likelihood Estimation - The Basic 28 / 35 


Maximum Likelihood Estimation - Ag 
The Basic Idea 


Application 3. Let X1, X2,..., Xn be a random sample from some DF F 
on the real line. Suppose that we observe x1, X2, ... , x; which are all 
different. show that the MLE of F is F,, the empirical DF of the sample. 


Proof. Empirical DF at the point x is defined as F,(x) = £USSXP Let 
F(x) = 0, the parameter of the distribution. 


Define Y; = 1(0) for X; < x (X; > x), /=1,2,...,n. Then 


Pal YA 1] = PRIX; < x] = F(x) = 8. So Y, Yo,..., Yn follow 
Bernoulli(0), 0 «X 0 X 1; independently. 


Then the MLE of 0 is Y = V Y;/n = £US9., Proved 
i=1 


Saurav De (Department of Statistics PresideMaximum Likelihood Estimation - The Basic 29 / 35 


Maximum Likelihood Estimation - GQathanen prin 
The Basic Idea 


TRY YOURSELF! 


M1.1. A random sample of size n be drawn from a population with density 
0 
= 0 ow. 


Find MLE of 0 if exists. 


M1.2. Find MLE of 0 based on a random sample of size n for the 
following Rectangular distributions 


a. R(-6,0) 0-0 
b. R(6,92) 0-1 


Saurav De (Department of Statistics PresideMaximum Likelihood Estimation - The Basic 


30 / 35 


"t "m 
Maximum Likelihood Estimation - atn: 
The Basic Idea 

TUTORIAL DISCUSSION : 


Overview to the problems from MODULE 1... 


M1.1. The positive joint density of x = (x1, xo,..., Xn) is 


h(x) = Lx; 25 M SON 


So the likelihood function of 0 for given x is 


0" 


L(0) = 


,0«0«oo 


Saurav De (Department of Statistics PresideMaximum Likelihood Estimation - The Basic 319/135 


Maximum Likelihood Estimation - GET ween 
The Basic Idea 


Here L(0) may not be a monotone function of 0. 
Using maxima-minima principle we can find that L(0) gets maximised for 
the value of 0 as —=—*— € (0,00). 


, log x; 
i=1 


Hence the MLE of 0 is 6 = ——^ — 


, log X; 
i=1 


Saurav De (Department of Statistics PresideMaximum Likelihood Estimation - The Basic 32:7-35 


Maximum Likelihood Estimation - A eee manna 
The Basic Idea 


M1.2. a. The joint pdf of x = (x1, x2,...,Xn) from R(-0,0) 0 >O is 


f(x) WR em =O X xmmo,... X, SM) 
= 0 otherwise 
Or 
1 
f(x) = (20) —OSX1) € X9 S8 


= 0 otherwise 


Now —0 € xaj ; X € 6 <=> 8 2 max(—x) , xt] - 


Saurav De (Department of Statistics PresideMaximum Likelihood Estimation - The Basic 


33 / 35 


Maximum Likelihood Estimation - A ee manna 
The Basic Idea 


So the likelihood function of @ is 
1 
20)” 
otherwise 


L(8) max {—X(1) ; X(n) } « 0 « oo 


Hence L(9) | 0 for 0 < max {—x(1) , X(n} € 0 < oc. 


= MLE of 0 is Ê = max (-Xq) , Xm} 


Saurav De (Department of Statistics PresideMaximum Likelihood Estimation - The Basic 34 / 35 


Maximum Likelihood Estimation - GL ADULT. prin 
The Basic Idea 


M1.2. b. The likelihood function of 0 based on x = (x1, x2,...,Xn) from 
R(0,0*) 0 — 1is 


1 2 


= 0 otherwise 


That is 
1 
L(8) = &(0—1) 1« 4/ X(n) <0< Xa) 
= 0 otherwise 


Now it is easy to check that under 
1 < Xn) € 0 € xa), L(0) > 0 and | 0. So the supremum of L(6) 
attains for Ê = \/X(m). => the MLE of 0 is \/X(ny 


Saurav De (Department of Statistics PresideMaximum Likelihood Estimation - The Basic 


35 / 35 


Ë Jathshala 
9 UToOSITelT f p otafinda 


m 


MLE on Truncated Parameter Space 
Module 2 


Saurav De 


Department of Statistics 
Presidency University 


Saurav De (Department of Statistics Preside 


MLE on Truncated Parameter Space 


To 
MLE on Truncated Parameter Space oe 


Let us begin with an illustration. 


Suppose from a Poisson (A) distribution 10 randomly chosen sample 
observations are as follows 


2 3 4 1 2 1 2 1 3 Y 


If the sample observations are denoted by x = (x1, x2,..., X10), then given 
x the likelihood function of A is 


| exp[—10A]\1%* — exp[-10A]A!? 


HA) 10 6912 
lI 
i=1 


10 
where X = sample mean = 1.9 and J [x! = 6912 from the fiven data. 
i=1 


Saurav De (Department of Statistics Preside MLE on Truncated Parameter Space 2725 


à Jathshala 
e. 


MLE on Truncated Parameter Space e urs 


A simple plot of log L(A) against nonnegative values of A based on the 
following table 


A | log L(A) 
0.5 | -27.0108 
0.8 | -21.0807 
1.1 | -18.0301 
1.5 | -16.1372 
1.9 | -15.6458 
2.2 | -15.8603 
2.5 | -16.4315 
2.8 | -17.2782 
3.0 | -17.9674 
3.3 | -19.1565 


is as follows: 


Saurav De (Department of Statistics Preside MLE on Truncated Parameter Space 3/25 


MHRD 


IW Govt. of India 
b 


MLE on Truncated Parameter Space eee 


Scatterplot of logL (lamda) vs lamda 
-15.0 


-17.5 


-20.0 


-22.5 


logL (lamda) 


-25.0 


-27.5 


0.5 1.0 1.5 2.0 2.5 3.0 3.5 


Saurav De (Department of Statistics Preside MLE on Truncated Parameter Space 4/25 


To 
MLE on Truncated Parameter Space oe 


We know for nonnegative A of Poisson probability model the ML estimate 
is X (try yourself), i.e. here 1.9. This is evident also from the plot that the 
log of likelihood attains its maximum at 1.9 (very close to 2). 


Now, suppose instead of [0, oo) if the parameter space is like [4o, oc) , Ao, 
known, the following cases are obvious to realise at least from the plot. 


case |: If Ag < x < oo, L(A) is still maximum at x —> MLE (Â) = x 


Saurav De (Department of Statistics Preside MLE on Truncated Parameter Space 5/25 


MLE on Truncated Parameter Space ser 


^ 


case Il: If Ag > x, L(A) | A — A= 


^ 


À 


Ao if aX < Xo 
x if xX € [o, oo) 


Similarly, if the parameter space is like [Ao, 41] , Ao , A1, known, the 
following cases are obvious to realise at least from the plot. 


case |: If Ap > x, L(A) | A — à= 20 


case Il: If A1 < X, LA) TA = À 2X 


Saurav De (Department of Statistics Preside MLE on Truncated Parameter Space 67/225 


MLE on Truncated Parameter Space eter 


vy = Ao if x< do 
x if x€ [A] 


Saurav De (Department of Statistics Preside MLE on Truncated Parameter Space Tf 25 


MLE on Truncated Parameter Space GQathsnata 


Next we consider another illustration with Exponential (mean = 0) 
distribution having density 


1 
f(x) = 5 975) w $o 
= 0 otherwise 
where 0 € (0, 00). 
Based on a random sample X1, X2,...,X, the likelihood is 


L(8) = d; exp( — 7x;/0) , 0 € (0,00) 
i=1 


Using maxima-minima principle, L(@) attains its maximum at sample mean 
x. 


— MLE (6) =X. 


Saurav De (Department of Statistics Preside MLE on Truncated Parameter Space 8725 


MLE on Truncated Parameter Space ser 


Now, suppose instead of (0,00) if the parameter space is like [0o, oc) , 8o, 
known, the following cases are obvious to realise at least from the plot. 


case |: If 0 < x < oo, L(0) is still maximum at x —> MLE (6) =x 
case Il: If 9 > x, L(0) | 0 — 0 — bo 


6 = ho if x«09 
x if X € [0o, oc) 


Saurav De (Department of Statistics Preside MLE on Truncated Parameter Space 9/25 


atn: Jathshala 


MLE on Truncated Parameter Space TESTAT 


Similarly, if the parameter space is like [05,01] , 09 , 61, known, the 
following cases are obvious to realise at least from the plot. 


case I: If 09 > x, L(0) | 8 — 0 — bo 


case Il: If 04 < x, L(0) $50 — 0—6 


case IIl: If 09 < x < 1, L(0) is still maximum at x —> MLE (0) =x 


6 = bo if x «6g 
= x if X€ [09,01] 
= 0 if x> 


Saurav De (Department of Statistics Preside MLE on Truncated Parameter Space 10 / 25 


MLE on Truncated Parameter Space a 


An illustration of the computation of MLE on truncated parameter space 
has been done on exponential distribution using R. 


R Code and Output : 


theta0=2#lower bound of parametric space 
thetal=3#upper bound of parametric space 
theta=runif (1,theta0,thetal) 

theta#value of the parameter that is considered 
[1] 2.896314 


MENEN N 


> samp=rexp(1000,(1/theta))#random sample drawn 
> x_bar=mean (samp) 

> x_bar#mean of the random sample drawn 

[1] 2.848733 


Saurav De (Department of Statistics Preside MLE on Truncated Parameter Space 11725 


MLE on Truncated Parameter Space eee 


MHRD 


Govt. of India 
) 


R Code and Output (continued) : 


> if(x bar > thetail) 
1 
jeans (C NL) 


cat("MLE of "theta = 


} 

> if(x bar < theta0) 
1 
(eius (CU Nm d) 


cat("MLE of "theta" - 


} 


",theta1l,"\n") 


" theta,” \n") 


if(x_bar >= thetaO & x_bar <= thetai) 


{ 
cat("\n") 


cat("MLE of "theta" = 


y 


" »X_bar , "\n") 


MLE of ’theta’ = 2.848733 


Saurav De (Department of Statistics Preside 


MLE on Truncated Parameter Space 112) 35 


MLE on Truncated Parameter Space iQ athsnata 


The discussion, that is done so far on the MLE under truncated parameter 
space, is a quite informal way. But based on this much, it is enough to 
state that in general if T = T(X) be the MLE of a parameter 0 under full 


or unrestricted parameter space, then under the truncated parameter space 
like [0*, 0**] 


6(X) 0* if T(X) < 0* 
T(X) if 6* < T(X) x 0'* 


o** if T(X) > 0** 


Saurav De (Department of Statistics Preside MLE on Truncated Parameter Space 137/125 


o 
MLE on Truncated Parameter Space Ag press 


Formal discussion on MLE based on truncated parameter space 


Illustration. Let X;,..., X, be iid N(j, 1) where u € [0,00). The the log 
of the likelihood of ju is 


log L(u) = constant — DE 


A log L(u) = n(x — u) and s log L(u) = —n for any u > 0. 
At u = 0 we consider differentiation with respect to u from right hand side. 


If x € [0, oc), from F.O.C. and S.O.C. we have MLE fi = x 


Saurav De (Department of Statistics Preside MLE on Truncated Parameter Space 14/25 


MLE on Truncated Parameter Space ser 


x«0 = gjegl(u) «0 V u 20 « logl(u) Lu V n2 0. 
=> L(y) is maximised at u = 0. Therefore 


fi = xX if x>0 


Saurav De (Department of Statistics Preside MLE on Truncated Parameter Space 115.7 25 


To 
MLE on Truncated Parameter Space A Gahaa Menta. 


e Asymptotic distribution of (fj when u » 0 : 


v/n(& — u) = v/n(x — u) + v/n(à — x) e 
Now n(x — n) 5 N(0, 1) for any u € [0, oc) 


And as fi—- X =0 => |/n(fi— X)| < e so 


Pl n(à—X) «dq» Pia -—X = 0] = P[X > 0] = O(n) — 1 as 
ne Do? 


— Ex 


Finally © = > n(f — n) E N(0, 1) 


Saurav De (Department of Statistics Preside MLE on Truncated Parameter Space 16/25 


i fs 
MLE on Truncated Parameter Space oe use 


e Asymptotic distribution of £j when uj = 0 : 


Here P[X > 0] 2 iV n 
Thus the distribution function F, of \/n(ji — 0) is defined as 


F(y = Oify <0 
o(/ny) if y > 0 


Note. This is not normal. Thus the asymptotic distribution of the MLE is 


normal for all u € [0,00) except at the boundary point u = 0 of the 
truncated parameter space. 


Saurav De (Department of Statistics Preside MLE on Truncated Parameter Space 172/125 


MLE on Truncated Parameter Space atn: 


Comparison through MSE criterion: MSE (fi) < MSE (X) for all 
H € [0, 00) 


Justification. MSE (&) = fo^ (x — u)?fg(x )dx 4- fo. F(x) dx 
MSE (X) = [o°(% — nf dx ^ - 1)? F(R) dR 


= MSE (X) - MSE (ji) = f? (X — 20)fg(x)dx 


Saurav De (Department of Statistics Preside MLE on Truncated Parameter Space 


18 / 25 


MLE on Truncated Parameter Space ser 


Now x < 0 under the range ih => alsox—2u<Oasp>0. 

= x(x-24u)20 = Pe x(x — 2u)fe(x)dx > 0. 

Hence justified. 

Note. If u € (0,00) in the above illustration, the asymptotic distribution 


of f; and X is same. For this reason one can use X as the MLE of p in this 
situation. 


Saurav De (Department of Statistics Preside MLE on Truncated Parameter Space 19 / 25 


i fs 
MLE on Truncated Parameter Space oe 


TRY YOURSELF! 


M2.1. Based on a random sample of size n from a normal (0, 0) 
distribution with c < 0 € 2c ; c(known), find the MLE of 0. 


M2.2. A random sample of size n be drawn from a population with density 


0 


= 0 ow. 


Find MLE of 0 € [K, oo), if exists where K is a given positive number. 


Saurav De (Department of Statistics Preside MLE on Truncated Parameter Space 20 / 25 


MLE on Truncated Parameter Space enses 


TUTORIAL DISCUSSION : 


Overview to the problems from MODULE 2... 


M2.1. The likelihood function of 0 for given x, 


2 Xp IS 


4" Ci ON X 
(9) = Braye ^72 6 


==> the loglikelihood of 0 is 


M. 

n 1 > 7 Xi 
(0) = | Ex 
(0) — constant 5 og 0 2 9 


Saurav De (Department of Statistics Preside MLE on Truncated Parameter Space 


21 / 25 


d Jathshala i| MHRD 
MLE on Truncated Parameter Space genes Tent tt 
So : 
I (8) = MZ. uni n ling 
*"^ 0200 2 9? 20 29 


n 
where sê = i > . QU 
= 


Case l. c< sj < 2c 


Here both s? and 0 € [c,2c]. Then 


I (8)lo-.2 = a 


=> MLE of 0 under case | is S 


Saurav De (Department of Statistics Preside MLE on Truncated Parameter Space 2/25 


MLE on Truncated Parameter Space ser 


Case ll. < <c 

Here s — 06 <0 

—. (0) gs (sd - 0) <0 
—. L(0) | 6 


=> MLE of 0 under this case is c. 


Saurav De (Department of Statistics Preside MLE on Truncated Parameter Space 239/225) 


MLE on Truncated Parameter Space ser 


Case IIl. < s? > 2c 
Here s — 0 >0 
= ore @ 
= L(0) t 0 


=> MLE of 0 under case Ill is 2c. 


Saurav De (Department of Statistics Preside MLE on Truncated Parameter Space 24 / 25 


MHRD 


ovt. of India 


MLE on Truncated Parameter Space eter 


Hence combining all these cases 


MLEÓ = cifs2«c 
= sj fcc 
= 2c if s 2c 


Saurav De (Department of Statistics Preside MLE on Truncated Parameter Space 


25 / 25 


S ethenals 


Learning Objectives 


MLE may be biased 


MLE may not be unique 
MLE may not exist 


MLE may be worthless 


asymptotically normal 


© MLE may not be sufficient, consistent, 


Aa OS 


ovt. of India 


ma-ma faq 


Ex. Let X1,76,..., Xn be iid R(0,0) variables; 0 € (0,00). Then, given 
x, the likelihood function is 


1 
Lx(0) E 550 «X SHIH 1,2, Ne 2 XQ) 


= 0, otherwise 
Where Xa = max X Moy ans XO Fe 


So faye = Xn), the maximum likelihood estimator of 0. (Already 


discussed) 


A 
sma-fasma. fryer 


The density of X(,; is 


g(t) — at 0ct«o 


O otherwise 


Now check that E(X(n)) = ZA 


So X(n) is biased and only asymptotically unbiased for 0. 


0 i.e. Z0but — 0 as n— oo. 


he T: 
de tx 2 a 


incon . of India 


Ex. Let X1. X5, ..., Xn be a random sample from 


N(01, 05) ; —oo «60000,0550 


Then the MLE of 05 is S? = LN (x; — X)*, the sample variance with 
j=1 
divisor n (To be discussed later). 


Here S? is not an unbiased estimator of 05 because 
—1 
E(S?) = — 0» 


Note. S? is asymptotically unbiased for 05 as under n — oc 


E(S?) ——À Ay. 


IAEA as 


Mn Govt. of India 
KOR 


e Ex. Let X1, X2,..., Xn be iid R(0,0 + 1) variables; 0 € (—9oc,oc). 


Then 
LA) = 1, @<x <@41;i/=1,2,...,n 
= 0, otherwise 
That is 
L,(0) = 1, Xn) —-1<9< x1) 


0, otherwise 
=> any value in Xn) — 1, x(1)| is a maximum likelihood estimate 


of 0. 


Note. If xn) — 1 = xy; then unique m/ estimate of 0 is 


X(1) — X( n) — ]. 


fh Govt. of India 
Ow 


e Jathshala 
m D 8 UTO3ITeIlT 


Ex. Let X ~ pmff ; f € (fj. h, fa, fa] where 


f(x) = Bin (5,0.8) 
h(x) = Poisson (1) 
f(x) = (0.8)(0.2)* 
x--1 
Rx) = = (0.4) 


(x + 1)In(0.6) 
Given X = 0 find the most likely choice of f. 


Here f is the unknown parameter 0. Given X = 0 the likelihood function 
L(0|X = 0) = P[X = 0[0]. Hence 


L(0|X = 0) 0.8, 0—R 


027 , 0—f 


41 MHRD 


|) Govt. of India 
KO» 


a-f qed 


Ex. Let X ~ pmff ; f € (fj, h, fa, fa] where 


f(x) = Bin (5,0.8) 
h(x) = Poisson (1) 
f(x) = (0.8)(0.2)* 
x+1 
fx) = — (0.4) 


(x + 1)In(0.6) 

Given X = 0 find the most likely choice of f. 

Here f is the unknown parameter 0. Given X = 0 the likelihood function 
L(0|X =0) = P[X = 0[0]. Hence 

08, @=f 

027 , @=b 


L(0|X = Q) 


2 Aa 1 M HRD 


Mil Govt. of India 


È Jathshala 
3 2 UTOSITelT 


BARA WA 


L(0|X — 0) 08, 0—h 


— 078, 0—f, 
So L(0) is maximum for any of f and fs. 
=> X=0MLE of feth Bp. 


Clearly MLE is not unique in this case. 


(3 MHRD 


f Govt. of India 
OR 


ma-ma fryer 


ə Ex. Let X1,X2,...,X, be iid Bernoulli (p) variables; p € (0, 1). 
Here the likelihood function of p is 


(p) = p2-^( - py" 2" 


9 IF x 20ex -—0,/—1,2,..., nthen L,(p) = (1 — p)" which is 
supremum at p = 0; but p > O. 

9 f$ x-nex-1Li-12...,nthen L.(p) = p” which is 
supremum at p — 1; but p « 1. 


am- fa faac 


Thus in each of the two extreme cases (i.e. X = 0 or 1) the supremum is 
attained at the boundary of Q = (0,1). 


2^ c (0,1). 


0 « Sx <n => Lp) is supremum at p= == 


Xi 


— X= is the MLE of p, provided 0 < Sx “pie 0O<¥ <1: 


n 


Otherwise MLE of p does not exist. 


Notel: If 0<p<1: Puig =X 


sfr frac 


|!) Govt. of India 
OR 


Note2:p € (0,1) — some trouble in defining the MLE of p if the 


samples — either x = 0 or 1. But 


P[X — 0 or I] 2 PD = bor n| -(1—p)' + p" — 0. 
i=1 


as n— oo for p € (0,1). 


=> the probability of occuring such an extreme case is negligible if n is 


large enough. 


— > forlarge n X can be used as the MLE of p € (0,1) in practice. 


Govt. of India 


e 
2H 


qra- fera. yard Un FA 


Note2:p € (0,1) == some trouble in defining the MLE of p if the 
samples — either x = 0 or 1. But 


PIX 20 or 1] 2 P[* X; 2 0 or n] = (1 — p)" + pP" — 0. 
i=1 


as n — oo for p € (0,1). 


=> the probability of occuring such an extreme case is negligible if n is 
large enough. 


=> for largen X can be used as the MLE of p € (0, 1) in practice. 


nes of India 


ei J3athshala 
TE 2 8 UTS lT eT 


An alternative way to remove the undefinedness of MLE if 


p € (0.1) 
Take MLE as 
D = € x0 
= FHS < 1 
= lz^mxml 


where c* and «** are infinitesimally small positive numbers. 


AO» 


fy Govt. of India 


(e 
Ds 
ale 


Obviously 


vn(P — p) = V/n(X — p) + V/n(p — X) 
From CLT \/n(X — p) = N(0, p(1 — p)) 


Also as p— X = 0 = {|p — X| A 3 for every € > 0, so 
PI/nl-X| «e| > P[p =X] = P[0 < X <1] =1-(1—p)" = p^ — 1 
as n — oo for every c > 0 for p € (0,1). 


A 
ma-ma gÀ 


GR 


Obviously 
v/n(P — p) = V/n(X — p) + V/n(p — X) 
From CLT \/n(X — p) La N(0, p(1 — p)) 


Also as b -= X 20— {|p — X| d 3 for every € > 0, so 
Pl,f/nlp—X| <<] > PIP =X] =P < X <1] = 1-(1-p)" 
as n — oo for every « > 0 for p € (0,1). 


= 


—» İ 


MAG 
KO 


ovt. of India 


oe beer ees. 
NM eS ITO SITeIT gO 


sma-fasra. Ag 


So, by Slutsky's theorem (i.e. X L c Y AZ = X4+YSZ4 c) 
we get 


Vn(p—p) > N(0, p(1 — p)) 


So the proposed MLE f and X are equivalent at least asymptotically. 


g Gin 


ə Ex. Let X ~ Bernoulli(p); 4 < p < å. Then 


L.(p) = l—p 


==> given x 2 0; pure = t and for x = 1; Bug = 3 


Only two co-ordinate points (0, 2) and (1, 2) on 
(X; BuiE(X)) > Pme = ? E. 


— x l i 
‘other possibilites are: s2 E : E etc which are all equivalent. 


È Jathshala 
2 à UTOS*ITelT 


E, (Bure — p) 


E i 
- «(I5 


1 
— (on simplification 
16 


MSE( pute) 


Now, just take a trivial estimator of p, say p* = 4 
MSE (p*) 2 (4 — p € = Vpe [i.i] with < for p# 4,4. 


— performance of Pye is uniformly worse even than p*, a trivial 


estimator of p. 


È Jathshala 
~ 2 8 UTO3ITeIlT 


ə Ex. Let Xj ~ N( ui o?) |$—lJAÉ ee xH.—le 


n 2 
Lal) = PRIMUS exp -» r6 — ur) 2g* 


j=1 j=1 


e The --— a u; and o? are respectively £j; = x; and 


y Y —xiy 


ge — S ( To be discussed later). 


Ge È Jathshala 
= D ) UToOSITelT 


e On simplification, finally we have 


Pu x2) oe pP o 
2 2 


(as by Khinchin's WLLN 1 ^69. 2, 1) 
=i 


& Pp eT | 
ə Hence o? — [ => o? is not consistent for o^. [Neyman and 
Scott] 


TO 
ma-a faac 


È Jathshala 
2 z OTST aT 


ə Ex. Let X1, X2,..., Xn be iid R(0,0) variables; 0 € (0,00). Then 
M = Xn), the maximum order statistic. (Try yourself) 


P[n(0 — Xin) < t] 


1 — P[X(n) p 0 — t/n| 


0—t/n 
Dc t” lgt 
Q^ Jo 


1 — (1 — —) — 1-—exp(-t/0) 


È Jathshala da MHRD 
Se eS UTo3ITelT OX 


ma-a gA 


ə = > n(0— Xim) — Z ~ exponential distribution with density 


zexp(—x/@); as n > oo. 


ə Note: Here /n(0 — X,)) = TM A5 d. 


Ari as wii 


pi 4, Govt. of India 


Ex. Let X1,..., X, follow independently the common positive density 


f(x) = exp{—(x —@)} , x 350; 0€ (—o0,0). 


Find the asymptotic distribution of X) — 0, where 
Aa = mind... An. 


The distribution function of Y, = Xa) — 0 is 


F.(y) Oify <0 


1 —exp(—ny) if y > 0 


fih Govt. of India 
KOR 


e doeet 


smra-frsra. fryer 


Ex. Let X,,..., X, follow independently the common positive density 


f(x) = exp{—(x —@)} , x 350; 0 € (—co, 0). 


Find the asymptotic distribution of Xa) — 0, where 
Xa = mini, . .., Xa. 


The distribution function of Y, = Xa) — 0 is 


Faly) Oify <0 


1 —exp(—ny) if y > 0 


fi Govt. of India 
KOR 


GR 


Define 


F*(y) Oify <0 


lify»0 


Check that 
(i) F*(y) is a distribution function, 


(ii) F*(y) is the distribution function of a degenerate probability 
distribution with degeneracy at y = 0, and 


(ili) F;,(y) — F*(y) for all y except at y = 0, the only point of 
discontinuity. 


Ë Jathshala 
e S UTOSITeIT 


Invariance Property and Likelihood Equation of MLE 
Module 4 


Saurav De 


Department of Statistics 
Presidency University 


tut 


Saurav De (Department of Statistics Presidelnvariance Property and Likelihood Equation 4 


Invariance Property and GL LIU. T 
Likelihood Equation of MLE 


MLE and Invariance Property 


^. 


Let Ô be MLE of 0. Then for the parametric function g(0) : Q — T; MLE 
is g(0). 


Proof. Let us define Q, = (0 : g(0) = y). This means Q = EL? 
yer 


Again let Mx(y) = sup Lx(0) = Likelihood function induced by g. 
BE 


We are to find 4 at which M,(7) is maximised. 


Saurav De (Department of Statistics Presidelnvariance Property and Likelihood Equation 


2/26 


Invariance Property and GU ETE 
Likelihood Equation of MLE 


Now M,(4o) = sup L.(0) > L,(8) 


OEN JEQ4, 


^ 


where Qs, = (0: g(0) = ĝo}. As g(0) — 4o so Ê € Qa, 


Again 


M.(^o) < sup Mx(¥) = sup sup L.(0) 
4e0 XEN OER 


= sup Lx(0) = Lx(0) 
0cQ 


Saurav De (Department of Statistics Presidelnvariance Property and Likelihood Equation 37/26 


Invariance Property and atn: 
Likelihood Equation of MLE 


Therefore 


Mx(40) = L«(0) = sup M«(). 
yer 


Hence 4o is the MLE of y, i.e. g(0)(— ĝo) is the MLE of g(0)(— »). 
Proved 


Ex.3 Let X1, X2,..., Xn ~ Bin (1,p) G8 < p< 1 
Then V,(X) = p(1— p)(7 g(p)) and Pure = Xn. 


By invariance property MLE of V,(X) is 


V, 
E(BuLE) = Pute(1 — bute) = X, TOES 


Saurav De (Department of Statistics Presidelnvariance Property and Likelihood Equation 4 / 26 


Invariance Property and GL LIU. 
Likelihood Equation of MLE 


Application 4. Suppose that n observations are taken on a random 
variable X with distribution N(y,1), but instead of recording all the 
observations, one notes only whether or not the observation is less than 0. 
If (X < 0} occurs m(« n) times, find the MLE of yu. 


Let X529. . , Xa ~ N(H, 1). 
Let 0 = P,[Xy < 0] = 6(-1) = 1 — (p). 


This means u = —71(0), a continuous function of 0. 


Saurav De (Department of Statistics Presidelnvariance Property and Likelihood Equation 5/26 


ifs 
Invariance Property and GU ETE 


Likelihood Equation of MLE 


Then Yi, Y,..., Yn ~ Bin (1,6); 0€ 6x1, 
Now the MLE of 0 is Y = 1Y ^y; = HX} — m. (See the Application 
3). 


Hence by the invariance property the MLE of u is —o-1(2). 


Saurav De (Department of Statistics Presidelnvariance Property and Likelihood Equation 6 / 26 


Invariance Property and Ginman d 
Likelihood Equation of MLE v 


Application. A commuter trip consists of first riding a subway to the bus 
stop and then taking a bus. The bus she should like to catch arrives 
uniformly over the interval 01 to 05. She would like to estimate both 01 and 
05 so that she would have some idea about the time she should be at the 
bus stop (01) and she should have too late and wait for the next bus (05). 
Over an 8-day period she makes certain to be at the bus stop so early not 
to miss the bus and records the following arrival time of the bus. 


5:15 PM , 521PM , 5:14PM , 5:23PM , 529PM , 5:17 PM 
, 5:15PM , 5:18 PM 


Estimate 01 and 05. Also give the MLEs for the mean and the variability of 
the arrival distribution. 


Saurav De (Department of Statistics Presidelnvariance Property and Likelihood Equation 7 / 26 


Invariance Property and GQathanen 
Likelihood Equation of MLE 


"S UToSITelT 
Solution: Let T : arrival time of the bus. Then according to the question 
T ~ R(61, 62). 


Let T1, T2,..., Tn be n independent random arrival times of the bus. 


Then the likelihood of 0 = (61,02) is 


1 
HO) 2X SS ito, <% <A i =1,2,,:., 
v) CER i 


0 otherwise 


1 
ie. L — if <t tim <0 
i.e. L(A) (65 — 6^ ' 1 $ f(1) € tn) S 62 
= 0 otherwise 
where tj = min (t, t2,..., tn} and tn) = max (t, to,..., tn}. 


Saurav De (Department of Statistics Presidelnvariance Property and Likelihood Equation 


8 / 26 


Invariance Property and Ag 
Likelihood Equation of MLE 


Under 06, < t1) < tn) < 62, tn) — ta) < 05 — 01. 


Hence L(61, 62) = eon < ety = L(t), t(n))- 


Thus MLE of (61, 02) is (T), T(n)), where T(z) and Ten) are the minimum 
and maximum order statistic respectively. 


Now E(T) = Eo and V(T) — (02701? are two continuous functions of 
(81, 05). 


So MLE of Mean : Lott © and of variability : Fine Fal (from invariance 
property of MLE) 


Saurav De (Department of Statistics Presidelnvariance Property and Likelihood Equation 9 / 26 


m 
Invariance Property and Qpad 


9 To SITelT 
Likelihood Equation of MLE 


Computation using R : 
(The minute component of the time has been considered as the data) 


R Code and Output : 


> samp=c(15,21,14,23,29,17,15,18)# given sample 
> n=length(samp)# size of the sample 

> n 

[1] 8 

> MLE_thetai=min(samp)# MLE of the parameters 

> MLE_theta2=max (samp) 

> MLE_thetai 


[1] 14 

> MLE_theta2 

[1] 29 

> MLE_Mean=(MLE_theta1+MLE_theta2)/2# MLE of mean 

> cat("The mean arrival time is :",MLE_Mean,"minutes after 5pm.\n") 
The mean arrival time is : 21.5 minutes after 5pm. 


> MLE_Var=(MLE_theta2-MLE_thetal)/sqrt(12)# MLE of variability 
> MLE_Var 


[1] 4.330127 


Saurav De (Department of Statistics Presidelnvariance Property and Likelihood Equation 10 / 26 


Invariance Property and @Qothshata a 
Likelihood Equation of MLE ai 


Likelihood Equations and Related Discussions 


L.(0) = In L,(@) is called log-likelihood function of 6. 


Likelihood Equation : 


Any MLE is a root of the likelihood equation. 


Any root may be local minima or local maxima. 


Possible verification for the root Ó to be an MLE: Eg <0. 


If @ = (61,..., 6s)’, the likelihood equations: 2d 29 estos us. 


TOS DEP UE Son for the root Ó to be an MLE of 0: The Hessian matrix 
Of ly 


00,00; 10,—0,0,—6; ) ) 'S negative definite. 


Saurav De (Department of Statistics Presidelnvariance Property and Likelihood Equation 117/26 


i ido 
Invariance Property and G a 


Likelihood Equation of MLE 


Result: Let T be a sufficient statistic for the family of distributions 

(fo :0 € Q}. If a unique MLE of 0 exists, it is a (nonconstant) function of 
T. |f a MLE of 0 exists but is not unique, one can find a MLE that is a 
function of T. 


Proof. Since T is sufficient, from Neyman-Fisher factorisability we can 
write 


L(O) = fe(x) = go( T (x))h(x) 


for all x, all 9 and some h and gg, where L(0) is the likelihood function of 
0 and fg(x) is the joint pmf or pdf of sample observations x = (x1,..., Xn). 


Saurav De (Department of Statistics Presidelnvariance Property and Likelihood Equation 127/26 


Invariance Property and GU ETE 
Likelihood Equation of MLE 


If L(0) is maximised by a unique MLE Ó, naturally it will also maximise the 
function ge( T (x)). 


— Ó should be a function of T. 


If MLE of 0 exists but is not unique, then 3 some MLE 0 that can be 
expressed as a function of T. 


Proved 


Saurav De (Department of Statistics Presidelnvariance Property and Likelihood Equation 13 / 26 


Invariance Property and AQathshata ‘ i 
Likelihood Equation of MLE ai 


Result: Under regular estimation case (i.e. the situation where all the 
regularity conditions of Cramer-Rao Inequality hold) if an estimator 0 of 0 
attains the Cramer-Rao Lower Bound CRLB for the variance, the 


likelihood equation has a unique solution Ê that maximises the likelihood 
function. 


Proof. Let L(@|x) denote the likelihood function of real-valued parameter 
0 given the sample observations x = (x1,..., Xn). If f(x) denotes the joint 


pmf or pdf of x, from the equality condition in CR inequality we get 


© log f(x) = k(0)(6(x) — 8) 


Saurav De (Department of Statistics Presidelnvariance Property and Likelihood Equation 14 / 26 


Invariance Property and atn: 
Likelihood Equation of MLE 


that is 3 
ag 08 L(8Ix) = K(8)(B(x) — 6) (*) 
with probability 1. 


==> the likelihood equation Z log L(0|x) — 0 has the unique solution 
0 = 6 


Differentiating both sides of (x) again with respect to 0 we get 


02 / A 
792 log L(0|x) = k (0)(0 — 0) — k(0). 
Hence 


82 
aga 8 L9). = —K(6) (=) 


Saurav De (Department of Statistics Presidelnvariance Property and Likelihood Equation 15/26 


Invariance Property and GL LIUC. 
Likelihood Equation of MLE 


Now, If T is an unbiased estimator of a real-valued estimable parametric 


function g(0) which is differentiable at least once, from CR regularity 
conditions we directly get 


f Taa = t6) 


Or, differentiating both sides with respect to 0 we get 


[ TOi og f Gd = g'o) 


In particular choosing g(0) = 0 and noting that Ey [3 log L(0|X)|] = 0 we 
get 


E (rœ = 07. log L(6|X)| — 1. 


Saurav De (Department of Statistics Presidelnvariance Property and Likelihood Equation 


16 / 26 


Invariance Property and GL LU. T 
Likelihood Equation of MLE 


Finally substituting 
1 0 
T(X) = 6 = [K(0)]* 5; log L(01X) 
(looking at (x) as T(X) is just like Ó(X) by its definition) we get 


2 
[k(0)] ^ E; ls log L(0\x)| =1 


That is 


2 
k(0) = Eo E log L(0\x)| > 0 by regularity condition 


Thus from (**) the S.O.C. for maximising L(0) holds. Hence proved. 


Saurav De (Department of Statistics Presidelnvariance Property and Likelihood Equation 17/26 


Invariance Property and atn: 
Likelihood Equation of MLE 


Result: Suppose a. « 0 V 0 € Q. Then Ó satisfying og) ) - 0 is the 
global maxima. 


Proof. 1,(0) = (0) + (0 — 0)* 521,  - Cz PR, 


à z4- 9 le-e- 6* e (0,0). 


Note that the RHS < /,(@) because, in RHS the 2nd factor of 2nd term 
vanishes and the 2nd factor of the third term < 0. Hence proved. 


Saurav De (Department of Statistics Presidelnvariance Property and Likelihood Equation 18 / 26 


à Jathshala 
© Q 


Invariance Property and 
Likelihood Equation of MLE 


Result: Suppose 
(i) 9599 — o it 9 — 6. 
.. 2 
(ii) PhO) | < 0. And 
(iii) 8 is an interior point of an interval / C Q. 


Then @ is the global maxima. 


Proof. If possible suppose Ó* is such that /,(0*) > I,(0). Then there must 
be a local minima in between the local maxima 0 and 0*. This means for 


that minima point also uu = 0, which is a contradiction to the 


supposition (i). Hence proved. 


19 / 26 


Saurav De (Department of Statistics Presidelnvariance Property and Likelihood Equation 


Invariance Property and nas H 
Likelihood Equation of MLE 


Ex.1 Let X1, X2,...,Xn ~ N(u, 07) independently. 


(xi— 1)? 
Then /,(@) = constant - ZAL — 2Ino?. 


So one) — 0 and pep 3 =0 imply Im Fu g? = LN (x - xp Lg 


0? (0 


Also check that Fok uo?) =(x,s2) = 9, 


Saurav De (Department of Statistics Presidelnvariance Property and Likelihood Equation 20 / 26 


Invariance Property and S ETE 
Likelihood Equation of MLE 


o? ad 


P lustus) 7 3 XX ^ lust) 7 =a. 


un ( E A oe ) 


So the Hessian matrix 


is negative definite. Hence (X, S?) is the global maxima point and is the 
MLE of (u, o°). 


Saurav De (Department of Statistics Presidelnvariance Property and Likelihood Equation 21 / 26 


Invariance Property and GU ETE 
Likelihood Equation of MLE 


Aliter: If y(x) = x — 1— difta When 
V(x)21-1/x , M nt» x0 


Therefore y(x) is minimum at x = 1 and miny(x) = 1— 1 — 0 = 0. Based 
on this result we can write 


2 n n 2 
5 — gins 


Gi-HYy 
2) = 2j | 2 


ns? n s2 n nis g 
"o e a l-—in2s| 2 0. 


kl, 02) p K( fo Ino 


Saurav De (Department of Statistics Presidelnvariance Property and Likelihood Equation 22 / 26 


Invariance Property and Ag 
Likelihood Equation of MLE 


TRY YOURSELF! 


M4.1. Let X1, X2,...,X, ~ Lognormal( u, o?) independently. Then MLE 
of (u, o?) is (Y, S*?) where Y = log X and (Y, S*?) = (sample mean , 
sample variance(with divisor n)) on Y. 


Hint. If X ~ Lognormal(j1, o?) then Y = log X ~ N(j,07). Now proceed 
as in Ex. 1. 


M4.2. (continuation)lf in M4.1. u = 0, find the MLE of c? 


M4.3. Consider a random sample of size n from Exponential (mean = f). 
It is given only that k, 0 « k « n, of these n observations are < M, where 
M is a known positive number. Find the MLE of f. 


Saurav De (Department of Statistics Presidelnvariance Property and Likelihood Equation 23 / 26 


Invariance Property and Ag n 
Likelihood Equation of MLE 


TUTORIAL DISCUSSION : 
Overview to the problems from MODULE 4 ... 
M4.2. If X ~ Lognormal(0, c?) then Y = log X ~ N(0, o°). 


Given y = (y1, ..., Yn), the loglikelihood function of c? is 
I(o?) = constant — 57 — x22 
Now using maxima-minima principle from F.O.C. we get 
n 


g^ — yoy? = 6? (say) is the only solution of the likelihood equation. 
i=1 


Saurav De (Department of Statistics Presidelnvariance Property and Likelihood Equation 


24 / 26 


Invariance Property and GU ETE 
Likelihood Equation of MLE 


Also amy Ke?)booo? = s <0. 


Moreover G? is an interior point of the parameterspace 2 = (0, oo). 


=> 6? is the point of global maxima of the likelihood function i.e. the 
unique MLE of c?. 


Saurav De (Department of Statistics Presidelnvariance Property and Likelihood Equation 25 / 26 


Invariance Property and tns T 
Likelihood Equation of MLE 


M4.3. Let X1, X2,...,Xn ~ Exponential (mean = £). 
Define Y; = 1(0) if X; € M (X; > M). 
Then Y4, Y,..., Yn ~ Bin (1,0); 0«60«1, 


where 0 = Pg[X1 € M] = 1 — exp [-4] . This means 8 = 


M 
— loeg(1-0)' a 
continuous function of 0. 


n 


Now the MLE of 0 is Y = DEM = FUOXEME _ E, (See the Application 
i=1 
4). 


Hence by the invariance property the MLE of £ is ^u 


Saurav De (Department of Statistics Presidelnvariance Property and Likelihood Equation 26 / 26 


E 


Multivariate Multiparameter MLE 
Module 5 


Saurav De 


Department of Statistics 
Presidency University 


Saurav De (Department of Statistics Preside Multivariate Multiparameter MLE 


ë p P To 
Multivariate Multiparameter MLE Ghana 


—«Ó ; iid 
Ex. (Multivariate multiparameter case) X1, Xo, ... X5 ~ No(u, £), a 


p-variate normal distribution. 
The likelihood function of 0 = (u, Ł) for given X = (x1, Xo, ..., Xn) is 


1 1 
L(p, X) = Y (x. BYE (xa — 
C (2)me/2 x P 22.6 Kt ced) 


Saurav De (Department of Statistics Preside Multivariate Multiparameter MLE 2/34 


ë P P To 
Multivariate Multiparameter MLE Ghana 


Now the exponent 


n n 


So (a — 1) EE Ka =) = Y 10 — X) E Kes) + mR YEAR) 


a=1 a=1 


The second term in RHS is a positive definite (p.d.) quadratic form as © 
is a p.d. matrix. Hence the term will be minimum (i.e. 0) iff u = x. 


Saurav De (Department of Statistics Preside Multivariate Multiparameter MLE 3/34 


NU ifs 
Multivariate Multiparameter MLE Ag ween 


=> keeping È to be fixed for the time being, L(z, X) will be maximised 
at fi =x. 


Then 


. 1 1X "TY E 
HADS yere? 7 | 52 6 - X)E iQ. = J 
So the loglikelihood is 


€(X) = constant + = 5 log lr 1s - 23 tr (Xa — X) E (x, — x) 


a=1 


[as a scalar is the trace of itself] 


Saurav De (Department of Statistics Preside Multivariate Multiparameter MLE 


4/34 


MM. : KO 
Multivariate Multiparameter MLE GQathshala 


where S = 4 > (Xa — X)(Xa — X)' is the sample variance-covariance matrix 


. . . a=1 
with divisor n. 


(In likelihood function data X is given and hence the matrix S also 
becomes known and is included in Constant term) 


This means 


b n -1 1 àc d - EV 
£(Z) = constant + 5 logg|x  S|— 5 Y D(a — X)(xo — x) 


(as tr AB — tr BA) Or 


1 
£(2) = constant + > log | x1 S| — 5 irnE Ls 


L(A) = constant + 5 log |A| — z tr A 
where A= X71 S. 


Saurav De (Department of Statistics Preside Multivariate Multiparameter MLE 5/34 


. LI è ld 9 
Multivariate Multiparameter MLE Ghana 


Let A1,...,Ap be the p eigen values of A. As both X and S are p.d. 
matrices, so also is A. 


p p 
=> all A; > 0. Also we know |A| = [Pp and tr A= Sori Using all 
i=1 i=1 


these we get 


p p 
n n 
L= &(A) = (A) = constant + 5 3 log Ai — 5 >N 


Saurav De (Department of Statistics Preside Multivariate Multiparameter MLE 


6/34 


Multivariate Multiparameter MLE atn: 


Now 


aCo0vi = xcii 
"T n E 
aye hat = 5 ' 
and 
o? -Wiji iZi 
DOA; = hjiizj. 


Hence by Maxima-Minima principle A; = 1 V į maximise the likelihood 
function. 


Also A; 21V i = > A= l|, the identity matrix. 


= rgo x = S(— $)(say) 


That is at È = S the likelihood is maximised. 


Thus the MLE of (u, X) is (X, S). 


Saurav De (Department of Statistics Preside Multivariate Multiparameter MLE 7 / 34 


. LI " Ro 
Multivariate Multiparameter MLE Ghana 


Application. (Bivariate Normal Distribution) This is a particular case of 
the above example with p = 2. 


Let X = (X1, X2) ~ No(s, ©) where 


KENS) and Z= ( ci e] 
p2 p0102 oi 


and p = Corr(Xi, X2). 


Saurav De (Department of Statistics Preside Multivariate Multiparameter MLE 8/34 


E" : WO 
Multivariate Multiparameter MLE GQathshala pd 


Based on a random sample of size n on (X1, X2) the MLE of (u, ©) will be 


Xi A. a = ( S? Sto ) 
v ag) Xa — X)(Xq — X) = 1 
X, : an 52. X ) S15 S? 
where S2 = ee — Xi» is the sample variance on the random 

1 


n 


variable X1 etc. and 515 = 1X (Xia — X1)(X2a — X2) is the sample 
a=1 
covariance on (X1, X2). 


Saurav De (Department of Statistics Preside Multivariate Multiparameter MLE 9/34 


: i : ido 
Multivariate Multiparameter MLE Ghana bal 
Si2 


Hence MLE of (u1, 12, 02, 02, p) is (X1, X2, 52, S2. m) where r — 


EE 


= sample correlation coefficient of (X1, X2). 


Saurav De (Department of Statistics Preside Multivariate Multiparameter MLE 10 / 34 


Ag athshala ' 


Multivariate Multiparameter MLE arse ar 


Application. (Bivariate Normal Distribution with reparameterization 
technique) 


Let (X, Y) have a bivariate normal distribution with parameters 

H1, H2, 02, 05, p. Suppose that n observations are made on (X, Y), and 
N — n observations on X only; i.e. N — n observations on Y are missing. 
Find the MLEs of p, 12, 02, 03, p. 


We know that for (X, Y) ~ BVN (m, 12, 02,03, p), 
X ~ N(u,oi) and Y|X =x ~ N(uo + pB(x — m), 03(1 — p°)). 


W.l.g. denote (11, 112, 02, 02, p) by 0 for simplicity. 


Saurav De (Department of Statistics Preside Multivariate Multiparameter MLE 11/34 


Multivariate Multiparameter MLE Ag 


Define a = p2 — pu , B= p22 and 7? = of(1 — p? 

= Y|X =x ~ N(a + Bx, 7"). Also we know 

joint density of (X, Y) = marginal density of X x conditional density of 
Y|X. 


Thus the likelihood of 0 based on the n paired observations on (X, Y) and 
N — n additional observations on X alone, will be 


N 


1 1 
L(0) = Const TOE esl 222.08 - af x 


1 j=1 


1 
(42^? exp Dy Sox — a— fx 


Saurav De (Department of Statistics Preside Multivariate Multiparameter MLE 


12/34 


ë t P To 
Multivariate Multiparameter MLE Ghana 


Let ( = In L(0). Then the likelihood equations w.r.t. 41,01, a, B and 44 
will be 


= 0 fa =x 23. A. M. (1) 

= 064 — st Me =e ees (2) 

A = 0= 44+8%,=/...... (3) 

M X 0= ax E A = D» niis (4) 
zs 0— 4 13 6j à - Bx) CER (5) 


Saurav De (Department of Statistics Preside Multivariate Multiparameter MLE 13 / 34 


. . . iT o 
Multivariate Multiparameter MLE aros 


Equations (1) and (2) directly provide the MLE of j11 and aĉ. Solutions of 
(3) and (4) can be expressed as 


5 512n x ". as 
p= 2 d à —y — BXn 
Stn 
n n 
where Sjo, = E > Xjyj — Xn y and S. = E (xj — Xn) Also on 
j=1 j=1 


simplification, (5) gives 


^ js 
^2 2 2.2 2 —)2 
4^ = s — b° Sin where sj = -> (yj - yy. 
j=l 


Finally using the relations as in (*) we get fi2 = â+ Bf , ê? =A? + ĝ2ô? 
and ô= 6% as the MLEs of [2,05 and p. 


Saurav De (Department of Statistics Preside Multivariate Multiparameter MLE 14/34 


-— ‘ Wo 
Multivariate Multiparameter MLE G a 


Application of MLE under BVN distribution using R 

Let (X, Y) have a BVN distribution with parameters p1, 12, 02, 05, p. 
Suppose that 10 observations are made on (X, Y), and additional 5 
observations on X only; i.e. 5 observations on Y are missing. These are as 
follows: 


x: 2.95 2.52 1.99 1.40 0.11 1.88 1.46 1.69 244 2.60 
y: 403 225 2.01 1.73 1.59 3.60 2.07 2.82 1.12 3.19 
x: 1.88 2.48 1.77 3.34 2.75 


Find the MLEs of p, H2, 02, 02, p. 


Saurav De (Department of Statistics Preside Multivariate Multiparameter MLE 15 / 34 


Multivariate Multiparameter MLE 


Computation using R : 
R Code and Output : 


> x=c(2.95,2.52,1.99,1.4,0.11,1.88,1.46,1.69,2.44, 
2.60,1.88,2.48,1.77,3-34,2.75) 

yexe (4508.5 91.915 591.73. 1 0781.89 BoB 2.07 282,112, 
.19,NA,NA,NA,NA,NA) 

mat=cbind(x,y)# sample observations 

n=10# number of paired observations (X,Y) 

N=15# total number of observations 

sumi-0 

for(i in 1:N) 

sumi=sumi+x[i] 

MLE_mui=sumi/N# MLE of mui 

sum2=0 

for(i in 1:N) 

sum2=sum2+((x[i]-MLE_mu1)~2) 
MLE_sigmal_2=sum2/N# MLE of sigmai. 2 
xn_bar=mean(x[1:n]) 

y-bar-mean(na.omit(y)) 


V V V tV VV FV VV VV wy 


Saurav De (Department of Statistics Preside Multivariate Multiparameter MLE 16 / 34 


Jathshala 
9) TO SITeIT 


Multivariate Multiparameter MLE 
Computation using R : 


R Code and Output (continued) : 


sum3=0 

se@pe (al sie GL gin) 

sum3=sum3+(x[i]*y[il]) 

si2n=(sum3/n) -(xn_bar*y_bar) 

sum4=0 

icone (a dae GL Bin) 

sum4=sum4+((x[i]-xn_bar) “~2) 

sin_2=sum4/n 

beta=0 

beta=si2n/sin_2# calculation of beta. hat 
alpha=y_bar-(beta*xn_bar)# calculation of alpha_hat 

sum5=0 

Ore (a aie il gin) 

sum5=sum5+((y[il]l-y_bar) 72) 

s2_2=sum5/n 

gamma, hat. 27s2 2-((beta^2)*sin.2) 

MLE_mu2=alpha+(beta*MLE_muil)# MLE of mu2 

MLE.sigma2 2-gamma hat 2*((beta^2)*MLE.sigmai 2)5 MLE of sigma2_2 
MLE_rho=beta* (sqrt (MLE_sigmal_2)/sqrt (MLE_sigma2_2))# MLE of rho 


MENTAN M E MONM NON GENIESSEN MENT SEEN EN 


Saurav De (Department of Statistics Preside Multivariate Multiparameter MLE 


. . $ Ro 
Multivariate Multiparameter MLE SP TUE 


Computation using R : 
R Code and Output (continued) : 


> MLE muí 

[1] 2.084 

> MLE_mu2 

[1] 2.539358 

> MLE_sigmai_2 
[1] 0.5740507 
> MLE_sigma2_2 
[1] 0.7839372 


> MLE. rho 
[1] 0.4675964 


Saurav De (Department of Statistics Preside Multivariate Multiparameter MLE 18 / 34 


E" : Wo 
Multivariate Multiparameter MLE oe 


Ex. (Multivariate multiparameter case) Let 
k 
X1, X2, ..., Xn © MN(m, pr, P2,- -. + Pk) O0<p<i1, X pi <1,a 
i=1 
k-variate multinomial distribution. 


Assuming m, the multinomial index, as known, the likelihood function of 
P= (P1; P2,-++ Pk) is 


n k 
) Xia 1 Xka k mn- S , Xia 


L(p) = Constant p^! ....p@* (1— S pi) a-li-i 


Saurav De (Department of Statistics Preside Multivariate Multiparameter MLE 19 / 34 


. LI * ld o 
Multivariate Multiparameter MLE eae 


=> the loglikelihood of p is 


n n 
((p) = Constant + PP. log pı +... + Y Xe log pk 
a=1 a=1 
(om — SS xa) oet- p) 
a=1i=1 
Or 
n n k 
> >>. 
e et a=1 a=1i=1 Vi 
Opi (p) Pi k 


L= » 8 
id 


Saurav De (Department of Statistics Preside Multivariate Multiparameter MLE 20 / 34 


—— : WO 
Multivariate Multiparameter MLE GQathshala 


Hence the likelihood equation : zt (p) —0Vvi 


fe) 
n k 
_ = Soe 
=> |= S Pi = ——%1'=1 _ (on simplification by addendo-dividendo 
i=l 


method) 


Substituting this form in likelihood equations (**) we get 


_ a=zli=1 _ Xi. gn ; 
pi — = — = pisay Vi 
mn m 
n 
where x; = i X Xia = sample mean of the ith component variable. 


al 


Saurav De (Department of Statistics Preside Multivariate Multiparameter MLE 21/34 


Multivariate Multiparameter MLE ser 


Again, on simplification 


82 


MHRD 


| Govt. of India 


ToT 


1 
ape Pl = -nm X; i n «0Vvi 
m— Xi. 
a=1 
And E " 
nm 
p)p-$; = — 7 «0Vnnj;iizJj 
OpiOp; ( Vp P a 


Saurav De (Department of Statistics Preside 


Multivariate Multiparameter MLE 22 / 34 


. LI " Ro 
Multivariate Multiparameter MLE Ghana 


Now the Hessian matrix is 


ai b b E b 
82 )) b a+b ... b 
(span, : : : 
b b ak+ b 
2 2 
where a; = 7 > 0 and b= —"— > 0. 
aA 


Saurav De (Department of Statistics Preside Multivariate Multiparameter MLE 23 / 34 


c" : Wo 
Multivariate Multiparameter MLE oe 


Claim: H is negative definite. 


Verification. Define 


al 0 0 

0 a... 0 
H*—-H- w À ; . + b11' 

0 0 ak 


Then 
y H*y = X aiy? + b(X_ yi}? > Owith = iff y = 0. 
i=1 i=1 


= >  H* is positive definite = > H is negative definite. 


=> (fj i=1,...,k} maximises the likelihood function and hence is the 
MLE of p. 


Saurav De (Department of Statistics Preside Multivariate Multiparameter MLE 24 / 34 


-— ‘ Wo 
Multivariate Multiparameter MLE oe 


TRY YOURSELF! 


M5.1. Let X1, X2,...,Xn be a random sample of size n drawn from a 
continuous p—variate distribution with common joint density 


f(x) = 1 p e| (nx RYE (nx — n) X0, X 
(21)^? EI] [x 
i=1 
= 0 Otherwise 
where 
x 059 7...) 0 Se OV i, Inx = (Inog,...,lnx), £O 


means È is a p x p symmetric p.d. matrix and p E€ RP. 


Find the MLE of (u, 2). 


Saurav De (Department of Statistics Preside Multivariate Multiparameter MLE 25 / 34 


Saurav De (Department of Statistics Preside 


j MHRD 


| Govt. of India 


. LI è ld 9 
Multivariate Multiparameter MLE Ghana 


M5.2. Let X1,..., X; be a random sample of size n drawn from a 


multinomial population MN (m; pi, po, P3, po, p1) such that 
2pi + 2p» + p3 € 1. Find the ML estimator of (pi, po, ps)’. 


Multivariate Multiparameter MLE 26 / 34 


MU dits 
Multivariate Multiparameter MLE C ETT 


TUTORIAL DISCUSSION : 
Overview to the problems from MODULE 5... 


M5.1. The given distribution is called Multivariate (p— variate) lognormal 
distribution with parameters (u, £). This is usually denoted by LN5(p. 2). 


As in the case of univariate distribution, in multivariate case also there is a 
relation between multivariate lognormal and multivariate normal 
distributions. This is as follows 


Let X = (X,,..., X9) ~ LNp(p, £). 


Then In X = (In Xs, ...,In Xp)’ ~ Np(p. X-). 


Proof of this result is very much straightforward. 


Saurav De (Department of Statistics Preside Multivariate Multiparameter MLE 27 / 34 


-— ‘ Wo ‘ 
Multivariate Multiparameter MLE G a 


Make a transformation from (X1,..., Xp) — (Yi,..., Yp) where 
Y; = In 


Now using Jacobian of transformation or otherwise, get the joint p.d.f. of 
Y = (Y1, .-., Yp). You will find that 


Y ~ Np(H, Y) 
Following this result we can say that in problem M5.1. 
Yi, Ya aA, Ys © Np(u, X) 


where Ya = InX4 , a= 1,...,n. 


Now following the ML estimation under multivariate normal distribution, 
we get the MLE of (u, 2) is (Y, Sy) where Y = sample mean vector and 
Sy — sample variance-covariance matrix on Y. 


Saurav De (Department of Statistics Preside Multivariate Multiparameter MLE 28 / 34 


. LI " Ro 
Multivariate Multiparameter MLE Ghana 


M5. 2. Here Xa = (Xia, X20, X30; Xao; Xsa)! , @=1,...,0. 


The likelihood of (pi, p2, p3} is 


n n n 
w (xia +X5a ) » Goa +X4a ) ya 
L(pi,po,ps) = Const pi nV B 


n 5 
(mS >So xa) 


+ Wüy-2p;—2p;—ps) at 


Saurav De (Department of Statistics Preside Multivariate Multiparameter MLE 29 / 34 


. LI " Ro 
Multivariate Multiparameter MLE SP TET 


n 


puo. + Xa) (mn — ora 


ð — acl a=1j=1 
5—4(p) = 
Opi pi 1—2pi — 2p> — ps 
n n 5 
$ Coo + x44) € ?(mn — NN oia) 
R 9p) — o-l a-1i-1 
Ope P2 1 — 2pı — 2P2 — P3 


and 


Saurav De (Department of Statistics Preside Multivariate Multiparameter MLE 30 / 34 


. LI " Ro 
Multivariate Multiparameter MLE Ghana 


n 


n 5 
3€, x 
LM N- «ll 
Ops P3 1 — 2p; — 2p2 — ps 


Hence a (p) =Ma => 


Saurav De (Department of Statistics Preside Multivariate Multiparameter MLE 31/34 


. LI è ld 9 
Multivariate Multiparameter MLE Ghana 


n 


S 60s + X40.) 


2p» - Ol 
1 — 2pı — 2P2 — pa m 
(m- 32359) 
a=1i=1 
And ) 
NL 
P3 = a=1 
1 — 2p1 —-2ps — ps HS 
mn- Or) 
a=1i=1 


Denote these three equations by (x). 


Saurav De (Department of Statistics Preside Multivariate Multiparameter MLE 32 / 34 


. . " Ro 
Multivariate Multiparameter MLE eae 
Now (*) ==> 


b Ss 


1 — 2pi — 2p» — p = == 
mn 
Substituting this in (x) ==> 
1 n 
Pı 2m 2 (Ma + Xa) = — (X1. + X5.) = fi 
1 n 
RO onn (x20 Xia) = 5 E + X4) = fo, 
a=1 


And 


Saurav De (Department of Statistics Preside Multivariate Multiparameter MLE 33/34 


. LI " Ro 
Multivariate Multiparameter MLE Ghana 


Now straightway find the Hessian matrix and it is not difficult to show that 
the Hessian matrix is n.d. at (pi, po, p3) = (fi. £o. P3) 
=> the likelihood is maximum at (1, fo, P3). 


= (Bios) : MLE of (p1, p2, p3). 


Saurav De (Department of Statistics Preside Multivariate Multiparameter MLE 34/34 


Ë Jathshala 
i WUTOSITeIT Jj gove.ofindia 


m 


One Parameter Exponential Family and MLE 
Module 6 


Saurav De 


Department of Statistics 
Presidency University 


o 
Saurav De (Department of Statistics PresideOne Parameter Exponential Family and MLE 


One Parameter Exponential Family GQstetats ABs 
and MLE 


One Parameter Exponential Family (OPEF) 


Suppose, based on a random sample of size n the joint pmf (or pdf) can 
be expressed as p(x) = exp [Q(0) T(x) + c(@) + D(x)] , 0, real valued. 
Assumptions: 
1. The first two derivatives of Q(0) and c(0) exist and are continuous. 
2. I(8) = Es [log L]? exists and is positive. 


where L is the likelihood function of 0 based on the random sample. 


Then P = (pe(x) : 0 € Q} is called OPEF. 


Saurav De (Department of Statistics PresideOne Parameter Exponential Family and MLE 2.9) 30) 


^ 
One Parameter Exponential Family Ag eee Men. 


and MLE 


Check that E, T(X) = -54 

Hint. Here po(x) = exp [Q(0) T (x) + c(8) + D(x)] 

ELOTE) =f {OT + O} wad) 

| È EPIROT + €(0) + DON at) 
E ay | we x) as | and 5; interchangeable 


= = (1) = 0 


=> FyT(x) - - 


Saurav De (Department of Statistics PresideOne Parameter Exponential Family and MLE 35/530 


^ à 
One Parameter Exponential Family EB. Monti. 
and MLE . 


VoT(X) = (S05 - e"(6)) cerise: 
Hint. 2> f po(x)d(x) = 25(1) =0 


Or f 3S po(x)d(x) = 0 (as f and 25 are interchangeable) 
=| © {Q(A)T(x) + € (0)) polax) = 0 
e [rrr x) + c" (0))poQ9 d(x) + (Q'(0) T(x) + ¢'(9))*po(x)d(x) = 0 


Saurav De (Department of Statistics PresideOne Parameter Exponential Family and MLE 4/30 


One Parameter Exponential Family eter eo tice 
and MLE 


Or 
Eo[Q/(0)T(X) + c"(0)] + Eg[Q'(0) T(x) + (0) = 0 
Or 


/ 2 
Q"(8)Es (T(X)) + c"(0) + (Q'(0))? E; [TO — Ea =0 
Note that 2nd term in LHS = (Q'(0))? V (T(X)). 


Hence get Vo (T(X)). 


Saurav De (Department of Statistics PresideOne Parameter Exponential Family and MLE 5/30 


One Parameter Exponential Family Qanman pin 
and MLE 


The likelihood equation for the probability model under OPEF is 


E In L= c'(0) + Q'(0)T(x) 20 


Or T(x) = - (9). 


Note. In particular for n = 1 the pmf or pdf fo(x) € OPEF if 
f(x) = exp [Q(0) T" (x) + (8) + hG9] , € e ACR) 


satisfying the abovementioned assumptions. Then 


EgT*(X) = om 


Saurav De (Department of Statistics PresideOne Parameter Exponential Family and MLE 6 / 30 


One Parameter Exponential Family Ag 
and MLE 


With this form of common density the joint pdf of n independent random 
sample observations will be naturally 


n 
po(x) = exp «T xi) + m(0) + E 
i=1 
Without loss of generality, it can be expressed as 


po(x) = exp [Q(0) T (x) + c(8) + D(x)] 
with T(x -Dr xi), = m/(0) and D(x -5M xi) 
Clearly pọ(x) € OPEF. 


Saurav De (Department of Statistics PresideOne Parameter Exponential Family and MLE 7/30 


One Parameter Exponential Family enses 
and MLE 


Result. The method of moments and the method of maximum likelihood 
agree each other for OPEF distributions. 


Let the common pdf(or pmf): fo(x) = exp[Q(0) T*(x) + v(0) + h(x)] € 
OPEF. Then, based on a random sample of size n the loglikelihood 


function : k(8) = Q(0)9 T* (x) + mi(6) + D(x). 


i=1 


Now P : 
Sor) roo 
Gk(0) =0 > Sip =-SL— = &T*(xX) = = ..(9 


But (*) is the moment equation with respect to the random variable 
T*(X). Hence proved. 


Saurav De (Department of Statistics PresideOne Parameter Exponential Family and MLE 8 / 30 


One Parameter Exponential Family GL ADULTS 
and MLE 


Result. For any distribution in OPEF 


(i) any solution of the likelihood equation provides a maximum of 
likelihood function. 


(ii) a solution of likelihood equation, if exists, is unique. 


(i) and (ii) == the solution of the likelihood equation is unique MLE. 


Proof. (i) Let Ó be a solution of the likelihood equation. Then 


e'(8) + Q'()T(x) 20 = — = T(x) 


Saurav De (Department of Statistics PresideOne Parameter Exponential Family and MLE 


9 / 30 


One Parameter Exponential Family eter e 
and MLE 


Now —/(0) = E (5: (0) = (0) + Q'(8)E(T(X)) < 0 V6. as 
I(0) > 0. 


= c"(0)- WO) SR « 0 v6 


But d (0). = c"(9) - Q"(0) 502. 


2 
= Fak) lpg < 0 
So 6 maximises the likelihood function of 0. 


Saurav De (Department of Statistics PresideOne Parameter Exponential Family and MLE 10 / 30 


MHRD 


ovt. of India 


: . ifa 
One Parameter Exponential Family CQ atnshata 


and MLE 


(ii) If possible suppose 3 another solution 6 of the likelihood equation. 


(i) => 6 also maximises the likelihood function, like à. 


= J another solution of the likelihood equation in between Ó and Ó 
which minimises the likelihood function. 


=> contradiction to (i). Hence our supposition is wrong. In other words 
the solution of the likelihood equation, if exists, is only one i.e. unique. 


Saurav De (Department of Statistics PresideOne Parameter Exponential Family and MLE 119/30 


One Parameter Exponential Family Qianmen 
and MLE 


Notes. 
1. e We know for OPEF distributions, T (X) is the complete sufficient 
statistic. 
e Also here the MLE is the unique solution of T(x) — at. 


==> the MLE and the complete sufficient statistic T(X) are in 1: 1 
relation. 


But we know by Rao-Blackwell-Lehmann-Scheffe Theorem that 
MVUE is a function of complete sufficient statistic. 


=> under OPEF the MVUE can be obtained from the MLE just by 
bias correction. 


Saurav De (Department of Statistics PresideOne Parameter Exponential Family and MLE 12 / 30 


atn: Jathshala 


One Parameter Exponential Family CIT 


and MLE 


2. We get 2 log L|; = —I(0) under OPEF 
e gre =a log Llä—o 


This helps evaluate /(0) avoiding mathematical expectation. In fact 
the equivalence of the moment equation and the likelihood equation 
under OPEF is responsible for this. 


We illustrate this interesting matter through the next example. 


Saurav De (Department of Statistics PresideOne Parameter Exponential Family and MLE 13 / 30 


One Parameter Exponential Family eter dix 
and MLE 
Ex. Xi,...,Xn 4 N(0,0) 


n 1 2 
log L= const — c log0 — 252. 


2 
2 
o A x ly 


where Ó is the unique MLE of 8. 


LEE, Fn lo dX 
992 98 200 B 60 |2 n 
n|O x 
& [5-7 
= I(0)— — 3 log Lig g7-—5$ [5 o] = 3g. 


Saurav De (Department of Statistics PresideOne Parameter Exponential Family and MLE 14 / 30 


: . i C 
One Parameter Exponential Family G o 


and MLE 


3. Let 3 an unbiased estimator T (X) of g(0) with variance attaining 
CRLB. Then p(x) is of the exponential form and 


E log L = k(0)(T(X) — g(0)) 


== the unique MLE of g(0) is T(X). 

4. Our discussion can straightway be extended to Multi-parameter 
Exponential Family (MPEF). The results are very much similar to 
those obtained under OPEF. For detailed study consult A First 
Course on Parametric Inference by B. K. Kale. 


Saurav De (Department of Statistics PresideOne Parameter Exponential Family and MLE 152/530 


One Parameter Exponential Family Qianmen | 
and MLE 


Power Series Distribution Family 


Let X ~ discrete probability distribution with p.m.f. f of the form 


ax0* . 
f(x) = if x 20,1,... 


= 0 Otherwise 


where 0 > 0 , a, is a positive real-valued function of x and g(0) is a 
positive real-valued function of 0. Any discrete probability distribution of 
this form is called Power Series Distribution. The corresponding family is 
called Power Series Distribution family. 


Saurav De (Department of Statistics PresideOne Parameter Exponential Family and MLE 16 / 30 


One Parameter Exponential Family eter 
and MLE 


As fo(x) is a p.m.f. at the point x, 


S (x)-1 = g(0)- Mad , 0» 0. Now 
x20 x20 


E(X) = Sx 2a 


Saurav De (Department of Statistics PresideOne Parameter Exponential Family and MLE 17 / 30 


One Parameter Exponential Family eter 
and MLE 


0 0 


== FOTIA 


o 
= ET In g(0) 


If X3,..., Xn be n random observations on X using method of moments 
we get the moment equation 


E(X) =X, sample mean 


Or 


Saurav De (Department of Statistics PresideOne Parameter Exponential Family and MLE 18 / 30 


One Parameter Exponential Family eter 
and MLE 


On the other hand the likelihood function of 0 is 


n 


TE) gr 
(g(0))^ 


This implies the loglikelihood function of 0 is 


Xi 


L(0) = 


((0) = const + 2x In — nin g(0) 


i=1 


Now 


Saurav De (Department of Statistics PresideOne Parameter Exponential Family and MLE 19 / 30 


One Parameter Exponential Family eter 
and MLE 


Or 


ð X ð 
350) nj-n In g(0) 


Hence the likelihood equation is 


£(0) — 0 > 


D| XI 
FIEL 


00 
(x) and (xx) are same, = > the method of moments and the method of 
maximum likelihood coincide for power series distribution. 


Saurav De (Department of Statistics PresideOne Parameter Exponential Family and MLE 20 / 30 


One Parameter Exponential Family Qianmen 
and MLE 


Note. Depending on the choices of ax , 0 and g(0) functions, we get a 
family of power series distributions. 


One of the well-known members of this family is Bernoulli distribution [for 
the choice ax = 1 for x = 0,1, 0 = 17; and g(0) = (1— p)! = (1 + 0)]. 
For this distribution 


Ing(0) =In(1+0) — i ng(0) — (1-6)! = (1— p) 


Saurav De (Department of Statistics PresideOne Parameter Exponential Family and MLE 21 / 30 


One Parameter Exponential Family Ag pres 
and MLE 


Hence the likelihood equation 0 5 Ing(0) =x becomes 
———(1-p)-XxX 4> p-x 


This is also the solution of the likelihood as well as moment equation. In 
fact 6 = X is the MLE as well as the MME (Method of Moment 
Estimator) of p under Bernoulli(p) distribution. 


similarly Poisson(A) distribution is another member with the choice 
i= 0 = A and g(0) = exp[6]. 


i, 


Here also we can verify in the similar way that X is the MLE as well as the 
MME of A. 


Saurav De (Department of Statistics PresideOne Parameter Exponential Family and MLE 22 / 30 


One Parameter Exponential Family Qianmen | 
and MLE 


Polynomial Type Exponential Distribution and MLE 


A random variable X has polynomial type exponential distribution if its 
density is defined as 


f(x) = exp[—) °0;x'] ; x >0 
i=0 


where the exponent is a polynomial in x of degree at the most m and any 
one parameter say po is a function of the remaining parameters. 


As 


f exp[700—61x—65x?—...—04x"]dx = 1 => exp[0o] = J exe[- 76i] 
i=1 


Saurav De (Department of Statistics PresideOne Parameter Exponential Family and MLE 237/30 


One Parameter Exponential Family eter Menta. 
and MLE l 


Then the rth population raw moment is 


WW eED) = ] Y oos VIS 
i=0 


O Mcd 
= exp[ 60] | 3; exp[ S oix ]dx 
E, 


— J ete X É Yee 


Saurav De (Department of Statistics PresideOne Parameter Exponential Family and MLE 24 / 30 


One Parameter Exponential Family a 
and MLE 


Let X1,..., Xn be drawn from above distribution. Then using Method of 
Moments m moment equations are 


1 n 
u = m, (- x) WI M m 
a= 


That is 


ð = i / 
gj, ^ | PX olak = ni. r= L2 mee iss (x) 


On the other hand the likelihood function of 0 = (01,..., 0m) is 


L(0) = exp {— 6} exp LX] 
i=1 ao-1 


L(@) = exp (—n6o) exp Dl 


Saurav De (Department of Statistics PresideOne Parameter Exponential Family and MLE 


25 / 30 


One Parameter Exponential Family atn: 
and MLE 


Now 
À 19) - 2 — (o) exp (—n 6o) ex 237. 
86, = "56, 0) exp Os Exp a 
conie (nto) ed -0X fink 
i=1 


8.10) = n (—z8-(60) — m) exp {=n 60} exp XL n= 


i=1 


n (& In J exe[- S 76i ]dx — a) exp (—n ĝo} exp xn r= 
i=1 i=1 
1,2,...,m 


Saurav De (Department of Statistics PresideOne Parameter Exponential Family and MLE 26 / 30 


í . ifa 
One Parameter Exponential Family G athshala 


"SOTO AT 
and MLE 
Hence the likelihood equations are 
a8 =0 r=1,2, ,m 
<> 
E: n f sel-Y 01d umo. r=1,2,..., M... (*x) 
o0, : 


i=1 


[as exp (—n ĝo} exp XT > 0] 


i=1 


Since (*) and (**) are ientical, again for this distribution also the method 
of moments and the method of maximum likelihood agree to each other. 


Saurav De (Department of Statistics PresideOne Parameter Exponential Family and MLE 27 / 30 


One Parameter Exponential Family atn: 
and MLE 


TRY YOURSELF ! 


M6. 1. let X;,..., X, independently follow negative binomial distribution 
with common pmf 


f(x) =! C(1— 0)"0* s des d sucede. 
Show that the distribution € OPEF and hence find the MLE of 0. 


Also get the Fisher's Information for 0. 


Saurav De (Department of Statistics PresideOne Parameter Exponential Family and MLE 28 / 30 


One Parameter Exponential Family i atnsn 
and MLE 


TUTORIAL DISCUSSION : 
Overview to the problems from MODULE 6... 


M6. 1. Choosing ax —'** 1 C, and g(0) = (1. - ay r pmf of the given 
negative binomial distribution becomes f(x) — FON which is a power series 
distribution and hence € OPEF. 

Also here 02,g(0) = 1. 

So from the equation 03r g(0) = X (which is moment as well as likelihood 


equation) we get 0 = —. 


= MLE of 0 :  — Ü (say) 


Saurav De (Department of Statistics PresideOne Parameter Exponential Family and MLE 29 / 30 


One Parameter Exponential Family eter Menta. 
and MLE l 


Check that here the loglikelihood of 0 is 


€(@) = Const + nrIn(1— 0 ) - Sox; Ino 


Then it is easy to verify that 


-3s 0 = 


EE. 62 6 
~ 6 (1-0 ° 1-6 


82 


So Fisher's Information /(0) = — 89 


((8)]5 = Way (on simplification) 


Saurav De (Department of Statistics PresideOne Parameter Exponential Family and MLE 30 / 30 


E 


Some Important Theorems on MLE 
Module 7 


Saurav De 


Department of Statistics 
Presidency University 


Saurav De (Department of Statistics Preside Some Important Theorems on MLE 


Some Important Theorems on MLE tns 


Theorem 1. 
Assumptions 
(i) X3, X2, ..., Xn are iid ~ p.d.f.fo(x) 

(ii) The probability distributions ( Po : 0 € Q} are distinct [i.e. there 
doesn't exist any 0; Æ 02 such that Pe, = Po,] 

(iii) The probability distribution (P5 : 0 € Q} have common support [for 
example not applicable for R(0, 0) distribution because there 
P, — R(0,1) ; P» — R(0,2) don't have common support] 

Then Pa, [Lx(@0) > L,(0)] > 1 as n — oo(0 F Opis fixed). 


Saurav De (Department of Statistics Preside Some Important Theorems on MLE 2725 


Some Important Theorems on MLE ser bd 


Proof. L,(0) = I xi) 


fo (x; 
So k( -Xn a TD 


Now 1X "In E = > Eg ln ex (by WLLN since X1, Xo, . . . , X, are iid) 
imi 


< InEg, 5 A Xi (by Jensen's inequality) 
= /Inl =0 


So Po, {Lx(0) < Lx(A0)} — 1 as n — oo. Hence proved. 


Saurav De (Department of Statistics Preside Some Important Theorems on MLE 3/25 


Some Important Theorems on MLE ser 


Theorem 2. 

Assumptions: (i) - (iii) [same as that in Theorem 1.] 

(iv) Q is open. 

(v) L(0) is differentiable everywhere in Q for almost all x. 


Then likelihood equation 2;/,(0) = 0, has one root which is consistent for 
0 with probability tending to 1. 


Saurav De (Department of Statistics Preside Some Important Theorems on MLE 4/25 


Some Important Theorems on MLE ser 


Proof. Let 0o be the true value of 0. 


As Q is open d > 0 > (09 — 6,09 4-6) CQ. 
Define S, = (x : k(0) > l(09 — 6) & Ix(00) > (Ao + 9)) 


= 


Po (Sn) = Po, [h(00) > l(89.— 8)] + Pos [x(00) > (00 + 6) — 1 
[using P(AN B) > P(A) + P(B) — 1] 
— 1asn—> œ (by Theorem 1)... ... (8) 


Saurav De (Department of Statistics Preside Some Important Theorems on MLE 5/25 


Some Important Theorems on MLE ser 


So given x, 3Ó,, a root of the likelihood equation > 6, € (09 — ô, Oo + ô). 


So 


(80) > l«(o — 6) & Ie(80) > kk(O0 +5) — 0, € (o — 5,00 + ô) 
Or . 
Pa, (S5) < Pa, iô, Salc 


So ($) => 4, -5 Øo with probability tending to 1. 


Proved 


Saurav De (Department of Statistics Preside Some Important Theorems on MLE 67/225 


MHRD 


ovt. of India 


Some Important Theorems on MLE ser 


Implication of Theorem 2: With probability tending to 1 as n — œ d 


a local maxima Ó,, which is consistent for 0 and hence //(0,) = 0. 


Saurav De (Department of Statistics Preside Some Important Theorems on MLE 1 Jr as 


m 
Some Important Theorems on MLE aus 


Result: If the likelihood equation has unique root Ên, then with probability 
tending to 1 as n — œ 0, is the MLE of 0 and it is consistent. 


Proof. By Theorem 2 Ó, is a local maxima of (0). If possible suppose 3 
a 0* > i(n) < l(0*).But 0, is a local maxima of /,(0). Hence 3 a point 
0** for which /,(0) is locally minimum. 


Saurav De (Department of Statistics Preside Some Important Theorems on MLE 8725 


MHRD 


ovt. of India 


Some Important Theorems on MLE ser 


Hence 2 Ik(8)]e-o* = 0. 


So [,(0) = 0 has another solution at 0 = 6**, which is a contradiction as 
0, is unique consistent. 


=> h(n) > k(6*) V 0*. That means Ó, is global minima. 


Saurav De (Department of Statistics Preside Some Important Theorems on MLE 9/25 


Some Important Theorems on MLE e 


Theorem 3. (asymptotic normality) 
Asuumtions: (i) - (iv) [same as that in Theorem 2.] 
(v) (0) is thrice differentiable with respect to 0 and the derivative is 
continuous. 
(vi) f fe(x)dx can be twice differentiable under the integral sign. 
(vii) For every 09 € Q ,3 ô > 0 and M(x) > 


o 
[is n )| < M(x) Y 6 € (00 6,09 -5) V x 
with Eg, M(x) < oc. 


(viii) 0 < i(8) = Ey (-2 ta z Info(X)) = E (BjInf(X))^ < oo 


Saurav De (Department of Statistics Preside Some Important Theorems on MLE 


10 / 25 


Some Important Theorems on MLE ser bd 


Then any consistent solution of the likelihood equation 6, satisfies 


B D 
v/n(8s — 90) —> N(0, 155) 


Proof. From Taylor's expansion of // (0) = Zh (0)|5. about 0o we get 
Â gy? 
Llên) = (69) + (Bn — 65) (60) + Pn 99 pry 


[0 lies between 6, and 80] 


Saurav De (Department of Statistics Preside Some Important Theorems on MLE 119/225 


Some Important Theorems on MLE eter 


As 6, is a solution of the likelihood equation — //(0,) = 0. 


So (*) => 
" =I) (8o) 
Mov. qn im) 
1Co) — Py (o) 
— ie 
= Di + Do 
Now 
1 
N = —=h(4) 


eT 
I) 
= vn 236" fo(xi)|e—&s 


Saurav De (Department of Statistics Preside Some Important Theorems on MLE 12) 35 


m 
Some Important Theorems on MLE oe 


As X;'s are iid so also are i In fo(X;)|o—6s- 

Moreover note that Eg, (3 In fo(X1)|o—0,) = 0 and 
ia? 

Vo, = Ea (& In fo(X1)lo=0) = i(0o) > 0. 


=> by Lindeberg-Levy CLT [Y1, Yo,... be a sequence of iid random 
variables with common mean u and common variance 


c? => vVn(Ys- u) 5 N(0,7?) ; Y, — 1 Yi 
i=1 


Tn 


1-8 D : 
N= vn = ò — In fo( xi) |o=05 > N(0, i(0)) — (**) 
n— 00 
i=1 
Saurav De (Department of Statistics Preside Some Important Theorems on MLE 


13 / 25 


Some Important Theorems on MLE eee 


Next D, = -1//(6g) = CS $5 In fo (%)lo0) 5 -5 i(85) (by Khinchin's 
WLLN) 


Also 
Hl ( pl 1 : E 
E = (9^) Dr n ] « DEF Xlo 
< Mes ) by condition (vii) 
=} E, M(X) (by Khinchin's WLLN 
« oo 
Saurav De (Department of Statistics Preside Some Important Theorems on MLE 


14 / 25 


Some Important Theorems on MLE ser 


And Ên — 0o S 0 as 6, is consistent for 0. 


(Ô, di 00) 


SCA 5 0 AQY CEEY 


(x) and («««) — Di D 5 i(&o) 


Saurav De (Department of Statistics Preside Some Important Theorems on MLE 115.7 25 


Saurav De (Department of Statistics Preside 


MHRD 


| Govt. of India 


Some Important Theorems on MLE ser 


— D, D; R N(0, g) under 6o [as by Slutsky's theorem 


x3 2 QY B c = $ e Z provided Y and c are non-zero] 
That is 


Proved 


Some Important Theorems on MLE 16 / 25 


Some Important Theorems on MLE tns 


Note: Any estimator satisfying (C) is called Efficient Likelihood Estimator 
(ELE). 


Note: If the Likelihood equation has unique root, then the MLE is 
asymptotically efficient. 


Again if the probability of multiroot for likelihood equation tends to 0, 
then also with probability tending to 1, the unique MLE is consistent and 
asymptotically efficient. 


Saurav De (Department of Statistics Preside Some Important Theorems on MLE 172/125 


MHRD 


ovt. of India 


Some Important Theorems on MLE tns 


Note: The conditions under Theorem 3 are sufficient but not necessary for 
the consistency and asymptotic normality of a solution of the likelihood 
equation. 

There are examples where not all the conditions are satisfied 
simultaneously; still a solution of the likelihood equation becomes 
consistant and asymptotically normal. 

One such example is as follows: 


Saurav De (Department of Statistics Preside Some Important Theorems on MLE 187/225 


Some Important Theorems on MLE tns 


Ex. Let Xy,..., Xs Æ N(0,0), 0 > 0. 


The likelihood equation of 0 will be 


L(0) — Gp ol- 


Then it is easy to check that the solution of the likelihood equation is 


On EA 


Also ŽŽ ~ x? — E (X?) =0, V (X2) = 202. 


Hence from Lindeberg-Levy CLT we can write, as n —> oo, 


3x- nÜ p 


2n? 


> N(0, 1). 


Saurav De (Department of Statistics Preside Some Important Theorems on MLE 


19 / 25 


To 
Some Important Theorems on MLE A Gahaa ai 


AS 6) o 
V202 ‘ 
Or 


I 
rm 
i 
N 
Ti 
[e] 
ahs 
_— | 
| 
N 
D 
N 


Again i(0) = E |-& log (X) 


Hence asymptotic normality of the solution of likelihood equation holds. 


Saurav De (Department of Statistics Preside Some Important Theorems on MLE 20 / 25 


MHRD 


ovt. of India 


Some Important Theorems on MLE ser 


Moreover we note that here, as n —> oo, 

8, 25 8. 
hence Ó,, the solution of the likelihood equation is consistant too. 
But $ log fo(x;) = EIN = — oo as 0 — 0. 


-= 2 log fo(x;) is not bounded in 0 < 0 < oo. And hence condition (vii) 


does not hold here. 


Saurav De (Department of Statistics Preside Some Important Theorems on MLE 21/25 


To 
Some Important Theorems on MLE G a 


The following theorem is a modification over Theorem 3 and it is so 
modified that the above cases are also covered. 


Theorem 4. Suppose all the conditions of Theorem 3. remain 
unchanged except Condition (vii) which is now as stated below: 


(vii) For every 0g € Q ,3 ô > 0, a positive and twice differentiable 
function g(0) and M(x) 5 


2 
A [g(9) In fo(x)]| < M(x) V 0 € (fo — 0o +8) V x 
with Eg, M(x) < oo. 


Note that under g(0) = 1 above condition becomes identical with that of 
Theorem 3. 


Saurav De (Department of Statistics Preside Some Important Theorems on MLE 


22 / 25 


à Jathshala 
e 


Some Important Theorems on MLE NE reete 


Proof. Referred to p. 500 of the book Mathematical Methods of 
Statistics by H. Cramer, and to G. Kulldorf (1957) On the condition for 
consistency and asymptotic efficiency of maximum likelihood estimates, 
Skand. Aktusrietidskr. 40, 129 — 144. 


In case of the above example if we choose g(0) — 05 which is a positive 
and twice differentiable function of 0, ES [g(9) In fa(x)]| = 1. Taking 
M(x) = 1 + x°, Condition (vii) above is satisfied with 

Eg, M(X) = 1 + 6o < oc. 


Saurav De (Department of Statistics Preside Some Important Theorems on MLE 23/25 


Some Important Theorems on MLE ser 


Note: The uniqueness of MLE does not ensure its asymptotic normality. 


Ex. Let X;,..., X, be a random sample from R(0, 0) distribution. Then 
Xn) is the unique MLE of 0. But asymptotic distribution of X/,), by no 
way, is normal. 


Saurav De (Department of Statistics Preside Some Important Theorems on MLE 24 / 25 


Some Important Theorems on MLE ser 


Ex.3 f(x) = exp [0 T (x) + c(0) + D(x)]. 


Let 8, be a solution of the likelihood equation 
EgT(X) = 1S T(X)) = T, say. 


i(8) = E (s Z Infy( s) = —c"(0) = V4T(X) 


So vn(Ó, — 80) ^» N(0. v). 


Saurav De (Department of Statistics Preside Some Important Theorems on MLE 


25 / 25 


Gg nu 


ITT 


MLE based on Iteration 


Some Typical Examples 
Module 8 


Saurav De 


Department of Statistics 
Presidency University 


Saurav De (Department of Statistics PresideMLE based on Iteration : Some Typical Exam 


ul 


tut 


i Wo 
MLE based on Iteration : GQathshala pin 
Some Typical Examples 
Ex. (Cauchy (0,1) Distribution) 


Let X1, Xo,...,Xn E Cauchy (0,1) ;0— population median (/ocation 
parameter) and scale — 1. Then density is 

4 1 

(lit (x-8y] 


fo(x) 


. Then the likelihood of 0 is 


1€ 1 
Rr wally + (x; — 6)? 


=> the loglikelihood is : /(0) = const — 2 log {1 + (x; — 6)? Y 
i=1 


= V) - 29 >. 
i=1 


Saurav De (Department of Statistics PresideMLE based on Iteration : Some Typical Exa 2/35 


MLE based on Iteration : @Qothshata 
Some Typical Examples 


=> the likelihood equation is 


~a Gn «€ 


In particular for n = 2, (*) == a cubic equation; 
for n = 3, (*) ==> a 5-ic equation etc. 


In general (x) =— a (2n — 1)—ic equation : practically impossible to 
solve for 0 explicitly even for n — 3 or 4 


Hence we call for an Iterative Method : A method to solve for 0 or a 
mehtod to get MLE of 0 


Saurav De (Department of Statistics PresideMLE based on Iteration : Some Typical Exa 3/35 


MLE based on Iteration : @Qothshata 
Some Typical Examples 


Iterative Method : Denote MLE of 6 by 6, 


To find 6, > l(0,) — 0 
Initial approximation Bon 
Then from Taylor's expansion we can have 
'(6n) © I (Bon) + (05 — Bon)!" (Bon): 


provided Bon is chosen in the close n.b.d. of Ó,,. 


= Ôn = Bon — sie (as (Bn) = 0) 


Saurav De (Department of Statistics PresideMLE based on Iteration : Some Typical Exa 4/35 


MLE based on Iteration : ser 
Some Typical Examples 


^ ^ I (Bo, 
Oin = Oon — Con) 


(1st approximation) 


bon = 0, — 4 (Pin) (2nd approximation) 


and so on. 


^ 


Stopping Rule: | Or+1,n— Ó,,| < € for some r; €: pre-assigned very small 


positive number 


(convergence criterion) 


Decision: Ê,n or 4,41, : Solution of the likelihood equation (x) i.e. MLE 
of 0. 


Saurav De (Department of Statistics PresideMLE based on Iteration : Some Typical Exa 57/35 


MLE based on Iteration : G Qathshaa 


Some Typical Examples 


Q. How to choose bon > iterative estimator is better? 


Def. T, is called \/n—consistent estimator for 0 if \/n(T, — 0) is bounded 
in probability; i.e. 
Ve»0, 3M and m> P(Vn|T, —0| » M) «e V n2 no. 


In other words T, is called \/n—consistent estimator for 0 if 
v/n(T, — 0) Ban 0, 


Note If J/n(T; — 0) D Z then T, is v/n—consistent for 0. 


Saurav De (Department of Statistics PresideMLE based on Iteration : Some Typical Exa 69/135 


MLE based on Iteration : ser n 


Some Typical Examples 


Proof. For every M > 0 
lim P{J/n|Tn — 6| > M} = P{|Z| > M}= 1- P{|Z| < M} 


=1-—{Fz(M)— Fz(—-M)}ļ4 M , where Fz(-) is the distribution 
function (DF) of Z. 


Now for every M > 0, P{|Z| > M} < yim P{|Z| >M}=0 
— 00 


= lim P{/n|T, — 0| > M} — 0 for every M > 0. Hence proved. 


Saurav De (Department of Statistics PresideMLE based on Iteration : Some Typical Exa 35 


MLE based on Iteration : ser 
Some Typical Examples 


Note : T, is \/n-consistent => T, is consistent. 
Proof. |T, — 0| > M = > \/n|T, — 0| > M for every M > 0. 


Hence for every M > 0, P[|T, — 8| > M] < P[/n|T — 6| > M] — 0 
(as T, is \/n-consistent for 0) 


=> P||T, —0| > M] — 0 for every M > 0. 


i.e. T, is consistent for 0. Proved. 


Saurav De (Department of Statistics PresideMLE based on Iteration : Some Typical Exa 8 / 35 


MLE based on Iteration : @Qothshata 
Some Typical Examples 


Result. If Go, is v/n—consistent for 0, the sequence of estimators 


. gn VI Oon 
Oin = Pon (on) E ee 
Ki (Gon) 


is asymptotically efficient under the assumptions of Theorem 3. 


l A A 1] Vn glêo) 
Proof : \/n(@1n — 90) = V/n(O0n — 80) Ij) 7 Un — V, (say). 


lot i-o 
Now z (Pon) = 2258 log fo(xi)la,, : 
where 5z log Tol ^i) IS a Sequence or iid random vVarlabies wit 
here & log fy(Xi)|g,, i f iid rand iables with 


Eg, (2 log fo(X;)) = 0 and variance = Eg, (2 log f(X;))° = i(fo). 


Saurav De (Department of Statistics PresideMLE based on Iteration : Some Typical Exa 


9 / 35 


MLE based on Iteration : ser 
Some Typical Examples 


Hence by Lindeberg-Levy CLT 
1 Ay A 7D) * 
vin k(Gon) —> N(0, i(00)) 
SOM Ns 1 o? 
Again RC = PT, log foCxi)la,. 


where 2 A log fo( X; Neon is a sequence of iid random variables with 


BR ti log fo(Xi)) = —i(8o). Hence by Khinchin's WLLN 


1 A 
=K Bon) P, —i(6). 


Saurav De (Department of Statistics PresideMLE based on Iteration : Some Typical Exa 10 / 35 


MLE based on Iteration : @Qothshata 
Some Typical Examples 


So using Slutsky’s Theorem, we have 


1j (Bon 1 
v, = VinklGon) P, N(0, - 
THB) K) 


) 


Also as ĝon is v/n-consistent for 0o, Un = Vn(bon — 6) P, 0. Hence 
again using Slutsky's Theorem finally we get 


A D 1 
Vn(b1n — 0) N(O, ay 


=> 61, is asymptotically efficient for ĝo. (as To) is the CRLB based on 
single random observation) 


Proved 


Saurav De (Department of Statistics PresideMLE based on Iteration : Some Typical Exa 11/35 


MLE based on Iteration : @Qothshata 
Some Typical Examples 


Rao’s Scoring Method: 


Result. Suppose Bon is a v/n—consistent estimator for 0. Then 


ài, = Bon + OR ; is asymptotically efficient provided /(0) is a 


continuous function of 0. 


Proof. (Hint) 1 = 1Y 25 (x)]y 4. > —i(0o) under 09 (by 
i=1 


Khinchin's WLLN). 


Again i(Bon) E i(0o) under 0o. Hence the proof is direct from the former 
Result. 


Saurav De (Department of Statistics PresideMLE based on Iteration : Some Typical Exa 12/35 


MLE based on Iteration : GU ETE 
Some Typical Examples 


Now the density of Cauchy(0, 1) is : fa(x) = COT 


T 


o 2(x— 0 
— gg ^e 00 = TT op aS 
and 
LA * [5s at ) ib (14 (x — 0)2)? 


8[* y 
RB myy” o= 


l+y 


4 oo 3/2-1 j 
= [ (i n z)3/273/2 dz (z m y ) 


Saurav De (Department of Statistics PresideMLE based on Iteration : Some Typical Exa 13/35 


MLE based on Iteration : ser 
Some Typical Examples 


Let X, — sample median. 


Recall: If Z, be the sample pth quantile 


vn(Zp — Gp) 2 N(0, 1) 


p(1—p) 
F? (Cp) 


as n —+ oo where f : population density and Çp : population pth quantile 
and n: sample size. 


Saurav De (Department of Statistics PresideMLE based on Iteration : Some Typical Exa 14 / 35 


MLE based on Iteration : et d d 
Some Typical Examples 


Note that X, E Zi and Gi — 0 for our Cauchy distribution. 


=> p(1- p)- 1i and fola) = 1 and hence 
Z, N(0,1) 


2 
Vn(Xn £50) 2, Z ~ N(0, D) 


=> X, is a yn—consistent estimator of 0. 


= X, can be taken as bon. 


Saurav De (Department of Statistics PresideMLE based on Iteration : Some Typical Exa 159/635) 


MLE based on Iteration : Ag 
Some Typical Examples 


Now proceed as Rao's scoring method. 


Illustration: Consider the following 15 sample values from a Cauchy (0, 1) 
population 


14.299 -0.165 -16.645 2.331 -0.108 122.423 1.952 0.372 
-2.487 -0.925 -2.662 0.275 -5.111 -0.461 -2.561 


Here Bon = x = —0.165 
15 a 
A A (xi | A) 1 
Oin m Oon 2 3 


Q 


—0.2779 correct up to 4 decimal places 


Saurav De (Department of Statistics PresideMLE based on Iteration : Some Typical Exa 16/35 


MLE based on Iteration : @Qothshata 
Some Typical Examples 


=> 05,2 —0.2916 , 63, = —0.2937, 04, ~ —0.2941, Êsn ~ —0.2941 . 


Here, 04, and 8sn are both correct up to 4 decimal places (from iterative 
recursion). 


=> the iteration converges correct upto 4 decimal places. 


So the solution of the likelihood equation for 0, correct up to 4 decimal 
places is —0.2941. 


Saurav De (Department of Statistics PresideMLE based on Iteration : Some Typical Exa 172/135 


MHRD 


ovt. of India 


MLE based on Iteration : S athenata 
Some Typical Examples 


Computation using R : 
(The solution of 0 obtained in the Cauchy distribution example was 
implemented in this way in R.) 


R Code and Output : 


> samp=c 

(14.299, -0.165, -16.645 ,2.331,-0.108,122.423,1.952,0.372, 2.487, 
-0.925,-2.662,0.275,-5.111,-0.461,-2.561)# sample drawn 
int_theta=median(samp)# consistent estimator of theta 
calculate=function(a, samp) 
{ 
new_a=0 
sum=0 


for(i in 1:length(samp) ) 
sum-sum*(((samp[il-a)/(1*(samp[il-a)^2))*(2/15)) 
new a-a*t(2*sum) 

return (new_a) 


} 


++++++++v v+ 


Saurav De (Department of Statistics PresideMLE based on Iteration : Some Typical Exa 18 / 35 


MHRD 


Govt. of India 


MLE based on Iteration : S athenata 
Some Typical Examples 


R Code and Output (continued) : 


i-0 

b=0 

new_theta=int_theta 

while (round(b,4) !=round (new_theta ,4) ) 
{ 

i-i-*1 

b-calculate(int theta,samp) 
cat("Value of theta in iteration",i," is : ",round(b,4),"in") 
if (b--int theta) 

break 

else 

1 

new theta-int theta 

int theta-b 

} 

} 


t++eteetetetrtetpetvevsy 


Saurav De (Department of Statistics PresideMLE based on Iteration : Some Typical Exa 19 / 35 


MLE based on Iteration 
Some Typical Examples 


R Code and Output (continued) : 


Value of theta in iteration 1 is : -0.2775 
Value of theta in iteration 2 is : -0.2916 
Value of theta in iteration 3 is : 0). BIS 
Value of theta in iteration 4 is : -0.2941 
Value of theta in iteration 5 is : -0.2941 
> cat("Converged value = ",round(b,4),"\n") 
Converged value = -0.2941 

> cat("Number of iterations = “,i,"\n") 
Number of iterations = 5 


Saurav De (Department of Statistics PresideMLE based on Iteration : Some Typical Exa 20/35 


MLE based on Iteration : ser 
Some Typical Examples 


Ex. (Logistic Distribution) 


Let X1, X2,...,Xn be a random sample of size n drawn from a Logistic 
distribution with common density 


.... exp[-(x-0)] 
bee {1+exp[-(x—)]}? ' 


= k(0) =(—) x + n6) — 29 In (1 + exp[- (i — 0)]) - 


oo«x«oo, —oo«Ü0 « oo. 


—. (0)=n 25 "uexpl- y + 8 


Saurav De (Department of Statistics PresideMLE based on Iteration : Some Typical Exa 21 / 35 


MLE based on Iteration : ser n 


Some Typical Examples 


"A — exp[(xi-?)] 
Here I/(0) = -25 eae oe 
Also I(@) — —n as 0 + oo 


And (0) — n as 0 + —oo 


= 3 Ome: k(Ouce) = 0 


. 1 = 
| So» n 29 ^ 1--exp[Gxi yi E)] p 


Now solve it by successive approx. method and get the unique ÎÊmLE (Like 
Cauchy(0, 1)). 


Saurav De (Department of Statistics PresideMLE based on Iteration : Some Typical Exa 22 / 35 


MLE based on Iteration : GQathsen 
Some Typical Examples 


Note : Under most of the truncated distributions, iterative method is the 
only option for finding ML estimates. 


Application : Let X;,..., X; be a random sample drawn from a 
Poisson(A) distribution where the mass point 0 is truncated. So the 


common pmf is 


em )\* 


WIR (T—e)4D O7 


The likelihood function of A is 


Pee. 
1 - ey [pe 


LA) = 


Saurav De (Department of Statistics PresideMLE based on Iteration : Some Typical Exa 23/35 


MLE based on Iteration : @Qothshata 
Some Typical Examples 


=> the loglikelihood of A is 
(A) = Const — nA + D Inà— nin (1 — e?) 


So the likelihood equation AMA) =0 = 


Obviously no explicit solution of the above equation is possible for A. 


Hence we have to depend on the Rao's scoring method or any other 
numerical method of iteration for finding ML estimate of A. 


Saurav De (Department of Statistics PresideMLE based on Iteration : Some Typical Exa 24 / 35 


MLE based on Iteration : eter 
Some Typical Examples 


Multiparameter Case : 


Suppose 0 = (01,...,0.) 


Theorem 5. Under suitable assumptions with probability tending to 1 as 
no0;daroot 0, 5 
(i) 6, 56 
2 
(ii) vm, — 8) 5 N,(0, 171 (6); I (@) = E (— s, 108 fo(X)) - 


Saurav De (Department of Statistics PresideMLE based on Iteration : Some Typical Exa 25:/ -35 


MLE based on Iteration : GU ETE bd 
Some Typical Examples 


Ex. Weibull density fa g(x) = pw" exp(—x?/8); o, B, x > 0. 


(0) = nloga — nlog 8 + (a — y» log x; — 233 ; based on a sample 
of size n. 


Bue 49. SE op 225 


op X 8? 
x? log x; 
2 (8) =06 2+% logxi — 2 oes =0 


o 3 = 
=> xs d i = LN ` log xi 
X: 


Saurav De (Department of Statistics PresideMLE based on Iteration : Some Typical Exa 


26 / 35 


— "D athshal 
MLE based on Iteration : Ag 7 ooo 
Some Typical Examples 


Let (a) = 15 ^ log x; (ae 1) = Au 


Now check that 


e d'(a) < 0; i.e. (a) is strictly decresing in a. (you can use 
Cauchy-Swartz inequality to get it) 


e d(a) — oo as a — 0* and ó(o) — 3 log x; — log x(n) < 0 as 


Q 
x? log xi 


a — oo. (you can use EE < log x(n) and 3 log x; € log x(ny) 
x? 


Saurav De (Department of Statistics PresideMLE based on Iteration : Some Typical Exa 


27 / 35 


MLE based on Iteration : Ag ween 
Some Typical Examples 


Hence 3 a unique a = â 5 ó(a) 20— 2-L(0) = 0. Also (using 
Cauchy-Swartz inequality or something else) check that the Hessian matrix 


a? 
He ang AORN aaga (P) is à 
soap k (laa aazk(Mlaa 


is negative definite (as |H| > 0 and both the principal diagonal elements 
are negative). As a result â and ĝ are the ML estimates of a and £. 


Saurav De (Department of Statistics PresideMLE based on Iteration : Some Typical Exa 


28 / 35 


MLE based on Iteration : Ag 
Some Typical Examples 


Application. The following data shows the maximum 24 Hr precifitation 
labels (in inches) recorded for 36 stalls in the period from 1990 to 1969 
during the time they were over the mountains. 


31.00 2.82 3.98 4.02 9.50 4.50 11.40 10.71 6.31 4.95 5.64 
5.51 13.40 9.72 647 10.16 4.21 11.60 4.75 685 6.25 3.42 
11.80 0.80 3.69 3.10 22.22 7.43 5.00 4.58 4.46 8.00 3.73 
3.50 60.20 0.67 


a. Draw the histogram. 
b. Identify the probability distribution. 


c. Estimate the parameters using Method of Moments and the Method 
of ML. 


d. Draw the fitted model. 


Saurav De (Department of Statistics PresideMLE based on Iteration : Some Typical Exa 29 / 35 


| MHRD 


Govt. of India 


MLE based on Iteration : ser 
Some Typical Examples 


Histogram of sample data 


0.06 
1 


Density 
0.04 
1 


0.02 
L 


0.00 
L 


Figure: Histogram of the sample data provided. 


ij 
ul 
ut 


a e 


Saurav De (Department of Statistics PresideMLE based on Iteration : Some Typical Exam 


MLE based on Iteration : Ag ween 
Some Typical Examples 


Discussion. From the histogram the underlying probability distribution 
can be considered to be Gamma with the common p.d.f. 


1 
fa p(x) RO fera ^P (-3) xs ; x,a, b >0 
Then the likelihood function is 


L(a, 8) = l exp 2^ lb 


Boe arto 


— 0,8) = -ran p- nta - 22 + (a1) 


Saurav De (Department of Statistics PresideMLE based on Iteration : Some Typical Exa 


31 / 35 


MLE based on Iteration : 
Some Typical Examples 


Then 


Now 2 /(a, 8) = 0 alongwith (*) <> —nl*' — 


where Ta’ = [5° exp(—x)x?-1log x dx. 


Saurav De (Department of Statistics PresideMLE based on Iteration : Some Typical Exa 


MHRD 


| Govt. of India 


Ë Jathshala 
e S UTOSITeIT 


32 / 35 


MHRD 


ovt. of India 


MLE based on Iteration : Ag ween 
Some Typical Examples 


e Moment Method of Estimation 


E(X1) = aß and V(X1) = aff?. Hence using Moment Method the 
moment equations are 


af = X and af? = S? 


where X and S? are the sample mean and sample variance respectively. 


On solving these equations the Moment Estimators of œ and £ are x and 


S? 
a respectively. 


Saurav De (Department of Statistics PresideMLE based on Iteration : Some Typical Exa 331/135 


MLE based on Iteration : G Qathshaa 


Some Typical Examples 


e ML Method of Estimation 


Using the Moment estimates as the initial approximation, one can solve 
the likelihood equations so obtained for o and f in the iterative way. The 
solution will give the ML estimates of œ and £. 


Finally replacing aœ and 8 by their estimates (preferably ML estimates) in 
the density model of the Gamma distribution, one can get the fitted model. 


Saurav De (Department of Statistics PresideMLE based on Iteration : Some Typical Exa 34 / 35 


MHRD 


(Govt. of India 


A Hi A 
MLE based on Iteration : QQathishala 


Some Typical Examples 


Computation using R : 
(The MLE and MME of a and f have been obtained in this way in R.) 


R Code and Output : 


library (rGammaGamma) 
data-c(31,2.82,3.98,4.02,9.50,4.5,11.4,10.71,6.31,4.95,5.64, 
Bo Bil lIin D 72) (9 SG 349) o l9 S s Bil 11.6 4.78 Sol So PIS S eC silii. 
ONSIRSECON3IN 29D 077 ASS 4. BB 54.48 989078 SDRICZ NONO. 
hist (data,freq=F,main="Histogram of sample data") 
X_bar=mean (data) 
S_2=var (data) 
alphaMME-(X bar^2)/8.2 
betaMME-S. 2/X. bar 
alphaMME# MME of shape 
[1] 1.589584 
> betaMME# MME of scale 
[1] 4.584532 
> gammaMLE(data)# MLE of shape and scale 
shape scale 
2.323328 3.143002 


MIENNE IM ETENN 


v 


Saurav De (Department of Statistics PresideMLE based on Iteration : Some Typical Exa 35 35 


a Jathshala 
e S UTOSITeIT 


LIKELIHOOD RATIO TEST (LRT): Basic Ideas 
Module 9 


Saurav De 


Department of Statistics 
Presidency University 


Tut 


Saurav De (Department of Statistics PresideLIKELIHOOD RATIO TEST (LRT): Basic Id 


LIKELIHOOD RATIO TEST (LRT): @@att 
Basic Ideas 


Let X1, X2, ... X, be iid with common p.m.f. or p.d.f. f(x). 
Lx(0) = Likelihood function of 0. 


Consider the problem of testing H : 0 € Qj, against 


K :0 € Ok(C Q — Qj); where Q : parameter space and Qy : Parameter 
space under H. 


sup L«(0) 
The likelihood ratio(LR) criterion is defined as A(x) = OE 
GEQHUQK 


Saurav De (Department of Statistics PresideLIKELIHOOD RATIO TEST (LRT): Basic Id 2/28 


LIKELIHOOD RATIO TEST (LRT):  Qigeeehate S 
Basic Ideas 


Note: 0 € sup Lx(0) < sup  L&(0) 


OENH OENHUNK 
ie. 0 < A(x) < 1. 
sup Lx(0) 
[An alternative form of LR criterion: A1(x) = ETO 
OENk 


Naturally here 0 < A1(x) < oo. This form is seldom used; may be due to 
the fact that here the LR criterion is unbounded above] 


Saurav De (Department of Statistics PresideLIKELIHOOD RATIO TEST (LRT): Basic Id 3/28 


LIKELIHOOD RATIO TEST (LRT): auahi 


Basic Ideas 


A Discussion: 
1. H is true => numerator and denominator of A(x) will coincide. 


So A(x) = 1 should imply that H is true trivially. 


Even a high value (close to 1) of A(x) : evidence in favour of 
acceptance of H. 


Saurav De (Department of Statistics PresideLIKELIHOOD RATIO TEST (LRT): Basic Id 4/28 


LIKELIHOOD RATIO TEST (LRT): —Gosthenata 
Basic Ideas 


On the other hand 


2. H is not true — the denominator of A(x) will give the supremum 
value of Lx(0) because in that case e the most likely value of 0, if 
exists, will lie within Qx and hence e within Qy U Qk but e not 
within QH. 


=> the numerator will be significantly less compared to 
denominator. 


=> A(x) will be significantly low (close to 0). 


Saurav De (Department of Statistics PresideLIKELIHOOD RATIO TEST (LRT): Basic Id 5/28 


LIKELIHOOD RATIO TEST (LRT):  G@f@athshala 
Basic Ideas 


This discussion can easily motivate us to frame the critical region as 
follows. 


Critical region: (x) < c, where c is 2 size of the test is a. 
[or A1(x) < «i, if the LR criterion is M(x)] 
Note. If the distribution of A(x) is discrete, randomised test may be used. 


Note. LRT entertains any kind of null and alternative hypotheses; simple 
as well as composite. 


Saurav De (Department of Statistics PresideLIKELIHOOD RATIO TEST (LRT): Basic Id 6 / 28 


LIKELIHOOD RATIO TEST (LRT): @@att 
Basic Ideas 


Note. Under simple null versus simple alternative hypothesis, LRT <=> 
Most Powerful Test using NP Lemma 


Proof.Let H : 0 = 0g(known) versus K : 0 = 01(known) [i.e. Simple null 
versus simple alternative] 


Let X3,..., Xn ~ pmf or pdf fo(x) 


=> sup L(0) = f(x) and sup L(0) = fo, (x) 
0—09 0—01 


Saurav De (Department of Statistics PresideLIKELIHOOD RATIO TEST (LRT): Basic Id 7/28 


LIKELIHOOD RATIO TEST (LRT): ser dis" 
Basic Ideas 


The LRT => 
sup L(0) 


—> the MP Test from NP Lemma. 
Proved 


Saurav De (Department of Statistics PresideLIKELIHOOD RATIO TEST (LRT): Basic Id 8 / 28 


LIKELIHOOD RATIO TEST (LRT): @@att 
Basic Ideas 


Note. If 3 a sufficient statistic T (X) based on X, then A(x) is a function 
of T(X). 


Proof. Using Neyman-Fisher Factorisation Theorem, we can write 


|. Süpgeo, Lx(P) ^ h(x)supeeo, g&e(T(x)) — Naz 
A M SUP8cO,LUOK L.(0) E h(x) SUPHENHUQK &0(T(x)) E i M 


a function of T(x). Hence proved. 


Saurav De (Department of Statistics PresideLIKELIHOOD RATIO TEST (LRT): Basic Id 9/28 


LIKELIHOOD RATIO TEST (LRT): ser nas 
Basic Ideas 


Ex. Let X1,X2,...,Xn ~ N(u, a?) independently. 


To test H : u = 0 versus (a) Ki: #0 and (b) Ko: 0. 


- x6 oia 


i=1 


L (9) = (2m)~"/2(0?)-"/? exp 


(a) Qu = {(u,0): u =0,0 > 0}; Qu UO = {(4,0): n € R,o > 0} 


Saurav De (Department of Statistics PresideLIKELIHOOD RATIO TEST (LRT): Basic Id 10 / 28 


LIKELIHOOD RATIO TEST (LRT): ser 
Basic Ideas 


n 
=. =. n 1 - 
Under Qu U Qk; fix; o? = 13 (x —J 
i=1 
2 n 
. 4 Fal 2 
Under Qj; of, = 1S x} 
j=1 
LR criterion 
n n/2 —n/2 


"m 3 (xy? 3 oy 


Saurav De (Department of Statistics PresideLIKELIHOOD RATIO TEST (LRT): Basic Id 


11/28 


LIKELIHOOD RATIO TEST (LRT): ser 
Basic Ideas 


Critical region: 
—n/2 


Y eos "SES 
fal 


x 1 
e A S tajan- where s? = ee 


This is UMPU test with test statistic tjj — wai which, under H, follows 
t-distribution with n — 1 degrees of freedom. 


Saurav De (Department of Statistics PresideLIKELIHOOD RATIO TEST (LRT): Basic Id 12 / 28 


LIKELIHOOD RATIO TEST (LRT): — @@atuahale 


Basic Ideas 
(b) QH UO, = {(u,0) : w= 0,0 > OF 


ieee x20 

= 0 if x<0 
So c? = 2) (x — fi)? 

i—1 
BS Ww l ,X < 0 (case of trivial acceptence of H) 
n n/2 
So (xi — x)? 
= [i45 X0 
nx 


Saurav De (Department of Statistics PresideLIKELIHOOD RATIO TEST (LRT): Basic Id 13 / 28 


LIKELIHOOD RATIO TEST (LRT): @Qothshata 
Basic Ideas 


Ex. Let X1, X2,...,Xn be n independent bernoulli(p) variables. Suppose 
we are to test 


H : p € po versus K : p> po. 
Here Quy = (p:0 € p € po} and QHUQK ={p:0<p< Il}. 


Under Qj, U Ox, the ML estimate of p is the sample mean x and hence 
the supremum of the likelihood is 


sup L(p)- (X) (1 — x)", 
QuUO« 


Saurav De (Department of Statistics PresideLIKELIHOOD RATIO TEST (LRT): Basic Id 14 / 28 


LIKELIHOOD RATIO TEST (LRT):  @@atuahale 


Basic Ideas 


Under Qy i.e. under p < po, MLE of p is 


^ 


D = X,X< Po 
= po, X> po (Restricted MLE of p) 


Hence sup L(p) = (x)™(1—x)"-*) , x < po 


Qy 


"x(1— x)= x) ^ X < po 


Saurav De (Department of Statistics PresideLIKELIHOOD RATIO TEST (LRT): Basic ld 15 / 28 


LIKELIHOOD RATIO TEST (LRT): ser 
Basic Ideas 


So A(x) 2 1 for x € po but A(x) = aei t <1 for x > po. 
A(x) | X 


Thus the LRT critical region A(x) < c 4 X > ci or equivalently 
as c5», where c» is 2 


SNP p [Yx > c] X a ie. Supe p | X > c] <a 


, & being the given level of significance. 


Saurav De (Department of Statistics PresideLIKELIHOOD RATIO TEST (LRT): Basic Id 16 / 28 


LIKELIHOOD RATIO TEST (LRT):  @@athehala i 
Basic Ideas 


In this case Xx ~ Bin(n, p) distribution. 


Now we know Pz[S ^X; > k] = Ip(n — k, k + 1), where 


p 
Wn, “eo 1) 2 (Bm. rk + 1)) 21 jah a — u)*du 
0 


B(n — k, k +1) being the Beta integral and /5(n — k, k +1), the 
incomplete Beta function. 


As lp(n— k, k 4- 1) p (evident from the definition of I,(n — k, k + 1)) 


=> sup P, ox > cl =P» ox > c2! <a. 


Saurav De (Department of Statistics PresideLIKELIHOOD RATIO TEST (LRT): Basic Id 17 / 28 


LIKELIHOOD RATIO TEST (LRT): ser 
Basic Ideas 


Thus the LRT for testing H : p < po against K : p > po is 


Reject H if p». > Co, where cp is the smallest integer > 


Pp, [Yx > e] <a. 


Note : In this case the LR test coincides with the UMP test. 


Saurav De (Department of Statistics PresideLIKELIHOOD RATIO TEST (LRT): Basic Id 


18 / 28 


LIKELIHOOD RATIO TEST (LRT): @@athenate i 
Basic Ideas 
Ex. Let X1, X2,...,Xn ~ R(0,0) independently. Then 


L.(0) = E if x) X 0 
— 0 ow. 


Let H:0 — 609 versus K :0 Z 0g. 


^ 


Under Qj U Ok ; 0 = Xin). Hence the LR criterion is 


Supoeco,, L«(0) o gg «Xm; 00) 


a SuPeeo,UQk L.(0) " 


0 if X(n) > Oo 


(case of trivial rejection of / 


Xm”. 


Saurav De (Department of Statistics PresideLIKELIHOOD RATIO TEST (LRT): Basic Id 


19 / 28 


LIKELIHOOD RATIO TEST (LRT): auahi 


Basic Ideas 


Now the LR test: 0 € A(x) « c 
where c is determined from size-ao condition. 


<=> LR test: x(n) < d or x(n) > 0o 
where d is determined from size-ao condition. 


==> LR test: X(n) < Oo al^ or X(n) 7 Oo. 


This is an UMP size-o test. 


Saurav De (Department of Statistics PresideLIKELIHOOD RATIO TEST (LRT): Basic Id 20 / 28 


LIKELIHOOD RATIO TEST (LRT): @Qothshata 
Basic Ideas 


Ex. A random sample of size n is taken from the p.m f. 

P(X; = xj) = pj, j = 1,2,3,4, 0 < pj < 1. Find the form of LR test of 
Ho : p = P2 = pa = pa = 1 against 

Hi: p po = p/2 , Bs = p4 = (1 — p)/2, 0< p < 1. 


Let nj — # times the value x; appears in the sample of size n (fixed). 
4 


Obviously Son =n 
i=1 


4 
Also n = (m, no, n3, n4)! ~ MN(n; pi, po, P3, pa) ; 7 =1. 


Saurav De (Department of Statistics PresideLIKELIHOOD RATIO TEST (LRT): Basic Id 21 / 28 


LIKELIHOOD RATIO TEST (LRT): ser 
Basic Ideas 


Then under Ho the likelihood function is 


Lu, (p|n) = C(n) - E)” a constant, where C(n) = n: 


mn! nating! 
Similarly La, (pln) = C(n) - p*(1 — p)" * where t = m + m. 
Now it is not difficult to get that the maximum of Lp, (p|n) is 


am - t'(n — t)"^* which attains at p = £. 


n 


Saurav De (Department of Statistics PresideLIKELIHOOD RATIO TEST (LRT): Basic Id 22 / 28 


LIKELIHOOD RATIO TEST (LRT): auahi 


Basic Ideas 


1M? 
Now the LR : A(n) = qa) 


CW) et (n—g)n-t 


=> the critical region based on the LR criterion is {A(n)) < Ky} 


> [ale z< ka) 
=> (Qt) > Ka} 


where y(t) = tIn t + (n — t)In(n — t), and Kı and Ko are suitable 
constants. 


Saurav De (Department of Statistics PresideLIKELIHOOD RATIO TEST (LRT): Basic Id 23 / 28 


LIKELIHOOD RATIO TEST (LRT): ser 
Basic Ideas 


Now check that Y(t) is minimum at t = 2 and v(5 — t) = v(5 + t) Vt. 
This is evident also from the following graph: 


Saurav De (Department of Statistics PresideLIKELIHOOD RATIO TEST (LRT): Basic Id 24 / 28 


LIKELIHOOD RATIO TEST (LRT): ETE 
Basic Ideas 


Scatterplot of psi(t) vs t 


o 
n 
n 
w 
ES 
a4 
o 
E 
co 
o 


Saurav De (Department of Statistics PresideLIKELIHOOD RATIO TEST (LRT): Basic Ide 


LIKELIHOOD RATIO TEST (LRT): auahi 


Basic Ideas 


So (y(t) > Ko) 4 {t< Kort >n— kK} 
where the constant K is 5 


Pro [T Sá K] < a/2 but Pm [T <K+ 1] > a/2 


with T = m +m ^? Bin (n, 1/2) (from the marginal distribution of 
multinomial distribution). 


Saurav De (Department of Statistics PresideLIKELIHOOD RATIO TEST (LRT): Basic Id 26 / 28 


LIKELIHOOD RATIO TEST (LRT): — Goathenata HY vero, 
Basic Ideas ü 


TRY YOURSELF! 

M9.1. Based on a random sample of size n from a Poisson (A) distribution, 
give an LRT for testing (a) H : \ = 2 versus Ky : A #2 and (b) 
H: A> 2 versus Ko: A < 2 

M9.2. A die is tossed 60 times in order to test 
H:PÍj) —1/6,j =1,2,...,6 (i.e. die is fair) against 
K: P4{2j— 1} = 1/9, P{2j} = 2/9, j =1,2, 3. 
Provide the LR test. 


Saurav De (Department of Statistics PresideLIKELIHOOD RATIO TEST (LRT): Basic Id 27 / 28 


LIKELIHOOD RATIO TEST (LRT): ser : 
Basic Ideas 


TUTORIAL DISCUSSION : 

Overview to the problems from MODULE 9... 

M9. 1. In part (a), QH = {A = 2}, a singleton set and 

QHUQK, = {A :0 < A < oo}; the unrestricted parameter space of A of 
which the sample mean x is the MLE. 


In part (b), under Qy = {A > 2}; a restricted parameter space of A, the 
MLE of A is 


X22 
«2 
Now proceed straightway as discussed in the worked-out examples. 


M9. 2. The solution is very similar but comparatively little easier to the 
solution of the last worked-out example in this module. 


Saurav De (Department of Statistics PresideLIKELIHOOD RATIO TEST (LRT): Basic Id 


28 / 28 


Ë Jathshala 
i WUTOSITeIT Jj gove.ofindia 


m 


Some Further Properties of LRT 
Module 10 


Saurav De 


Department of Statistics 
Presidency University 


Saurav De (Department of Statistics Preside 


Some Further Properties of LRT 


Some Further Properties of LRT GU ETE 


LRT may be worthless : 


Suppose the pmf of X with parameter @ is as follows: 


Pe-o(x) = a, x=0 
1 
= u— y, xe +1 
2 
Q 
= 5 y X = +2 
= 0 otherwise 
Saurav De (Department of Statistics Preside Some Further Properties of LRT 


2/19 


Some Further Properties of LRT ser 


1-c . 
Poco) = i-a if x= 
1—c 1 y 
a i553 X if x 2-41 


= @ifx=2,0<a<1/2 
= c(1-0) if x=-2, a/(2-a)<c<a 


= 0 otherwise 


To test H : 0 = 0 against K : 0 > O. 


The LR criterion A(x) = EE 


0<0<1 


Saurav De (Department of Statistics Preside Some Further Properties of LRT S/O 


Saurav De (Department of Statistics Preside Some Further Properties of LRT 


MHRD 


| Govt. of India 


Some Further Properties of LRT GU ETE 


ee es a} 


Given c <a =} 1—c»1-aie Ef > 1 4 (es) a>a. 


l-a I-a 
As a result A(0) = (=)a = io 
l-a 
Similarly \(1) = ore w i= _ A(—1) 


4/19 


Some Further Properties of LRT GU ETE 


In the same way 


a 
AQ) = j 
sup4 5, sup c0 
0<6<1 
a 
= 2 
"ID 
a 2-a 1 1 
Now 55, < c == =o » 20r pc 
2. 1 a 
= — >- SS 2 «c 
= 5 ts Q IS a, OD 
=> (2) = 2 = x. Similarly we get A(—2) = =. 
Saurav De (Department of Statistics Preside Some Further Properties of LRT 


5/19 


MHRD 


ovt. of India 


Some Further Properties of LRT S ETE 


a 2 1 
2 1 1—c 
Le m2. lle 


hence the LR test: A(x) «xd <=> A(x) < x 


Reason : A(x) has only two positive finite values +=% and & and among 
these two, the smaller is 5. 


Saurav De (Department of Statistics Preside Some Further Properties of LRT 6/19 


Some Further Properties of LRT GU ETE 


But A(x) < € € A(x) 


sie 
x 
| 
N 


=> the dritical region of LRT : {x = +2}. 


Size of the test = Pọ=o|x = +2] = 2 x $ =a. 


Power function is Pg[x = +2] = c0 --c(1—0) 2e «a V 0 » 0. 


=> the LR test is biased. 


Saurav De (Department of Statistics Preside Some Further Properties of LRT 7/19 


MHRD 


ovt. of India 


Some Further Properties of LRT ser 


Define a test function ¢*(x) = o (i.e. constant) V x 
— a trivial size a test. 
The power function is Eg ó*(x) =a V 0 > 0. 


=> ó* is unbiased and is uniformly more powerful than the above 
LRT, though ó* is a trivial test. 


=> the above LRT is practically worthless. 


Saurav De (Department of Statistics Preside Some Further Properties of LRT 8/19 


à Jathshala 
e S UTOSITeIT 


Some Further Properties of LRT 


Large sample properties of an LRT : 
Consider the following theorems. 


Theorem under single parameter case : Consider the problem of 
testing H : 0 = Oo against K : 0 Æ 0o. Under the assumption of Theorem 3, 


—2log A(x) X, x2, under H 


Saurav De (Department of Statistics Preside Some Further Properties of LRT 9/19 


Some Further Properties of LRT GU ETE Yon 


Proof. By definition 


A(x) = TAT Ó, denoting MLE of 0. 


^ 


Or log A(x) = /(09) — !(05) (I(0) : the loglikelihood function of 6) 


Expanding /(00) w.r.t. 6, by Taylor's expansion => 


NO0) = Bn) + (Bo — Bn) (Bn) + EÉ a) 


Saurav De (Department of Statistics Preside Some Further Properties of LRT 


10 / 19 


Some Further Properties of LRT GU ETE 


A a (Gn = 00)? "(f 
Ên) — (89) = =A (—1"(bn)) 
(as l(0,) vanishes because Ó, is MLE) 


Moreover —1/"(0,) = E fo(xi) |g_g, (also known as observed 


Fisher's Information) is acm M equal to i(n). 


Saurav De (Department of Statistics Preside Some Further Properties of LRT 11/19 


Some Further Properties of LRT GU ETE m 


Thus —2 log \(x) = 2 |I) 5 1(60)| Š (vn (6, L &)) i(On). 


Asymptotic distribution of MLE ==> 


vn (ô, A 2) P, Mo, cD 


Also i(0,) —> i(69) under H. So 


vn (6, — 2) ibn) = N(0, 1) under H 


Hence the theorem. 


Saurav De (Department of Statistics Preside Some Further Properties of LRT 


12/19 


MHRD 


ovt. of India 


Some Further Properties of LRT Ginman 


Theorem on multiparameter case : 
Suppose we are to test H : Osxı = Oo against H : Osxı Z Oo. 
Under the assumption of Theorem 4, 

—2 log A(x) Eu x2 , under H 


Proceed just like the earlier proof using asymptotic multivariate normality 
of the vector valued MLE @, of the s—component parameter vector 6. 


Saurav De (Department of Statistics Preside Some Further Properties of LRT 13/19 


Some Further Properties of LRT ser 


Note on asymptotic distribution : 


To test H : 0; = gi mh. 9e br); i= .%, 5,7 < S; Di D, being 
unknown, 


—2 log A(x) R2 x2 ,, under H 


where s — r — No. of parameters specified by H. 


Above result gets several applications on testing of hypotheses. 


Saurav De (Department of Statistics Preside Some Further Properties of LRT 14 / 19 


Some Further Properties of LRT GU ETE 


Application 1 : Let Xj,..., X;  N(u, a?) independently. 
To test H : jj = 0 versus k : u £0. 


Here parameters are (u, 07) Le. s — 2. 
a remains unknown althrough , i.e. r = 1. 
So here 


—2log A(x) eA xi, under H. 


Saurav De (Department of Statistics Preside Some Further Properties of LRT 15 / 19 


Some Further Properties of LRT @Qethsnata d 


Application 2 : Let X1,..., X; ~ Np(p, X.) independently. 
To test H : È = gl against K : È Æ o° lp (known as Sphericity Test) 


Here # parameters in general = s = p + peti and # unknown 
parameters — r — p+1. 


>. = _ p(p+1) 
— # parameters specified by H=s—r= a m 


= > —2logA(x) ea X udi > under H. 
E 


Saurav De (Department of Statistics Preside Some Further Properties of LRT 16 / 19 


Some Further Properties of LRT atn: 


Consistency of LRT : 


Consider the problem of testing H : 0 € Qy against 
K:0 € Qk«(C Q — Qg). 


L«(Óni) 
Lx(ô) 


Naturally LR criterion will be A(x) = 


where 6: MLE of 0 under Q and Ó,, : MLE of 0 under Qu 
Suppose by Eg € Qy under Qy and 


ôE Oe O14, U Ox under Q. 


Saurav De (Department of Statistics Preside Some Further Properties of LRT 17 / 19 


Some Further Properties of LRT GU ETE 


Let the LR test be : A(x) < C,(a) 


where C,(o) 2 the size of the test is o i.e. 


sup P; [A(X) < C,(a)] =a 
0EQH 


Under 0 € Qy , X(x) > 1 = (C,(a) — las n- oo. 


Saurav De (Department of Statistics Preside Some Further Properties of LRT 


18 / 19 


Some Further Properties of LRT GU ETE 


Also power at 0 € Qk = Pa [A(X) < C,(a)] 


Under 0 € Qk ; Ax) 34 « 1 
and C,(a) — 1 as n — œ. 


= > Pe[X(X) « C,(a)] — 1as n — oo V 0 € Oy 


Since power —> 1 as n — oo, the LR test is consistent. 


Saurav De (Department of Statistics Preside Some Further Properties of LRT 19 / 19 


ER oe 


ITT 


LRT : Applications under Single Distributions 
Module 11 


Saurav De 


Department of Statistics 
Presidency University 


Saurav De (Department of Statistics PresideLRT : Applications under Single Distributions 


ul 


tut 


MHRD 


ovt. of India 


LRT : Applications under Single G Qathshaa 
Distributions 


A common notion : The randomised test is applicable only when the 
parent population is discrete. 


It is an incomplete notion ! 


A randomised test is called for whenever the LR criterion A(x) has discrete 
probability distribution, no matter the parent population is discrete or 
continuous. 


Saurav De (Department of Statistics PresideLRT : Applications under Single Distributions: 2/36 


LRT : Applications under Single @Qothshata 


Distributions 


An example: Let X1, X2 ~ R(0,0 + 1) independently. To test Ho : 0 = 0o 
versus Hı : 0 = 061,09 < 01 < 09 + 1 


=> Xo = (0o € xi < 0o + 1, i = 1,2) : Sample space under Ho 
and A5 = {01 € x; < 064 +1, i = 1,2} : Sample space under Hı. 
So 


Lm (0) = 1, (x1, x2) € Xo 
= 0, (a,»x) € Xo 

and 
La (0) = 1, (1,2) € Xı 


1 
= 0, (x1,x2) ¢ 4% 


Saurav De (Department of Statistics PresideLRT : Applications under Single Distributions 3/36 


LRT : Applications under Single ser 
Distributions 


Now 
Lro (0) 
A(x) = = oo,x€ ANAL 
(x) Lu, (0) 0 1 
= 1, xE Ao NX 
= 0, xE XS n4 


=> Ph [A(x) = oc] = Ph [(X1, X2) € Xo N Xf] = 1 — (1 — 61 + 69y?, 
Ph [A(x) = 1] = Pa, X2) € Xo N X1] = (1 — 61 + 69)? 


and Pry [A(x) = 0] = Ph [(X1, X2) c XS f1 Ai =0 


Saurav De (Department of Statistics PresideLRT : Applications under Single Distributions 4/36 


LRT : Applications under Single ser 
Distributions 


=> A(x) has a discrete probability distribution. 
Also the critical region of LRT is 
W = {X(x) < c} , c to be obtained from size condition. 


Let given level of significance : a. Also 


Pry [A(x) < c] 


Ph [A(x) = 0] = 0ifO<c<1 
Ph [A(x) < 1] = (1 — 61 + 69)? ifl<c<o 


Saurav De (Department of Statistics PresideLRT : Applications under Single Distributions 5 / 36 


LRT : Applications under Single ser 
Distributions 


If (1 — 01 + 09)? = a i.e. if 04 — 0o = 1 — a2, then the CR of level (as well 
as size) a is 


W = {X(x) < c} , for any c € (1, 00). 
If œ < (1 — 64 + 0), call for a randomised test 
d(x) = 1,xE ANAT 


Ya, X € Xo N X 
= 0,xeAg n 


where ya 2 En, Q(X) =a 4> Ya = 


Q 
(1-01 +00)? ° 


Saurav De (Department of Statistics PresideLRT : Applications under Single Distributions 


6 / 36 


LRT : Applications under Single ser 
Distributions 


Aliter: define a CR 
W = {c < x; <04+1,/=1,2, 00 « c] 
where c is 5 PH,(W) =a 


Oo+1 
Now P4(W) = foa = (1 — (c — 69)? 


= (1- (c — 09)? = a, from size condition. 


= on simplification c = 1 + ĝo — va. 


Saurav De (Department of Statistics PresideLRT : Applications under Single Distributions 


7 / 36 


LRT : Applications under Single tns 
Distributions 


Application 1: Let X;,..., Xn ~ N(u,0°). To test H : o = og(known) 
versus (a) : o Z ao and (b) : 0 > oo. 
Here 0 = (u, 0). 


Qu = {(u,o): HER, o = 00}, 


OK, — ((u,0): HER, o >0,0 X ool 
and Qj; = {(u,o): u ER, oo < o < œ}. The likelihood is 


L(8) = (2n) "2 (g?)-"? exp f- (xi — n?) 


1 
—n/2,( _2\—n/2 2 
— pas (5) = (2m) gt) / ee[-z2) («-xY] 


Saurav De (Department of Statistics PresideLRT : Applications under Single Distributions 


8 / 36 


LRT : Applications under Single G Qathshaa 
Distributions 


: _ " 2 E 
yea L(8) = (2n)-""? (o2)-"? exp (E s*=1)S (x; -—x)*: 
sample variance 


Again under Qu U Qk, = {(u,0): u E R, o > 0), MLE of c? = s?. 


2 — 2} 
L MO —n/2f -2\—n/2 ns = exp { n/ 
F EA oti (s) Ee 252 (27s2)"/? 


=> in part (a) 
L 
eu, 10) 


wo BUS G $8) I} nm 


0cO4UO& 


where t = 5; and g(t) = t" exp(-v(t — 1)) , v = n/2. 
0 


Saurav De (Department of Statistics PresideLRT : Applications under Single Distributions 


9 / 36 


LRT : Applications under Single ser 
Distributions 


So the CR: A(x) «c = v(t) «a 

where y(t) = log g(t) = vlog t — v(t — 1) 

Now ri du 21 =Oort=1 

Also Y”(t)|+=1 = — pc 0. 

So the S.O.C. => y(t) is maximised (globally) at t = 1. 


Hence the CR: y(t) < a « {t< k ort > ko} 


Saurav De (Department of Statistics PresideLRT : Applications under Single Distributions 10 / 36 


LRT : Applications under Single S ETE 
Distributions 


2 2 
uw E or £P > dal 
where 0 < Ay < A2 Ð 


Pa | 
% 


2 H H M D 
As ny ~ X? 4, assumption of equal tail probability => 
0 


2 2 


<A] +P | 


= M = a(size condition) 
g 


sae V.» 
Ape Ai Sind »A2= XS in—1 


Saurav De (Department of Statistics PresideLRT : Applications under Single Distributions 11 / 36 


LRT : Applications under Single ser 
Distributions 


(b) Ou UQi = {(u,0): HER, 09 < a < co} 
=> Ny U Qk, : a restricted parameter space w.r.t. c. 
Here the ML estimate of c? is 


2 
ml 


s? if aĝ < s? < œ 
= og if s? < o$ 


=> accordingly 


max L(#) = Ep 
9EQHUAK, (2752)"/ 
exp {—n/2} 


= er 2 52 
= na when s* < a9 
(2706) 


when s? > og 


Saurav De (Department of Statistics PresideLRT : Applications under Single Distributions 12 / 36 


LRT : Applications under Single ser 
Distributions 


Hence the LR criterion will be 
2\ n/2 2 
s n{(s She 2 
A(x) = (5) xem[-5 (3) all if s^ o 
= 1 otherwise 
As A(x) = 1 => the case of trivial acceptance of H. 


= the CR: A(x) < c should be a subset of (s? > of} = {t > 1) 


Here also A(x) < c => g(t) « a € V(t) < c» with t > 1 now. 


Saurav De (Department of Statistics PresideLRT : Applications under Single Distributions 13 / 36 


LRT : Applications under Single ser 
Distributions 


As t = 1 is the point of maximum for g(t) or w(t), 
=> y(t) | t when t > 1. 


Thus v(t) « co € t» k. 


==> the LRT critical region will be (t > k} or [x > A} 
Tg 


where \ = x2.,, , due to size-a condition. 
a;n—1 


Saurav De (Department of Statistics PresideLRT : Applications under Single Distributions 14 / 36 


LRT : Applications under Single atn: 
Distributions 


Application 2: X;,..., X, 2 Exponential (mean — 6). 
To test H : 0 = bo versus (a) Kı : 0 Æ 0o , (b) Ko : 0 < 6. 


=> here Qy = {00} , Qk, — (0:0 50,0 Z 09) and (0:0« 0 < 0o}. 


The likelihood function : L(0) — x exp {-44| 


Saurav De (Department of Statistics PresideLRT : Applications under Single Distributions 


15 / 36 


atn: Jathshala 


LRT : Applications under Single TTET 


Distributions 


(a) QH U Qk, = (0:0 > 0) — unrestricted parameter space of 0. 


=> the MLE of 0 under Qy U Qx, is as usual x. 


L nx 
— gens 4 (0) = 4 exp {—} = 4 exp {-n}. 


As a result the LR statistic : 


where t = and g(t) = t" exp {—n(t— 1)}. 


Saurav De (Department of Statistics PresideLRT : Applications under Single Distributions 16 / 36 


LRT : Applications under Single S ETE à , 
Distributions 


So the CR : 
nos c e wit) oa 
where (t) = Ing(t) = nint — n(t — 1). 


Here also w(t) gets maximised globally at t = 1. 


Thus CR: (t) «e <= {t< kı} U {t > ko}, or equivalently 


[2e «A1 or 22. > | 


Saurav De (Department of Statistics PresideLRT : Applications under Single Distributions 17 / 36 


LRT : Applications under Single ser 
Distributions 


As X;s are iid exponential(mean = 6) 


= the LRT CR: (x2, < Ai or x2, > Az} where 0 < Ay < Az are» 
Py [xj € Aa] + Pu [Xj > Az] = a (size-a condition) 


Finally assume equal tail probability for simplicity and get 


At = X a n and Az = Xi. On 


Saurav De (Department of Statistics PresideLRT : Applications under Single Distributions 18 / 36 


LRT : Applications under Single ster 
Distributions 


(b) Qu U Qi, = (0:0 « 0 € 09) — restricted parameter space of 0. 
Thus MLE of 0 is 


0 = x ,0«x € 0p 
= o ,X > bo. 
Accordingly 
L = < 
CWA (0) —exp{—n} ,0<xK< bo 
= solu] , xX > 
= (x) 


- O'E 


Saurav De (Department of Statistics PresideLRT : Applications under Single Distributions 19 / 36 


LRT : Applications under Single ser 
Distributions 


But (A(x) = 1} or (x > 09) —9 region of trivial acceptance of H. 


=> the CR: (A(x) « c} € (x:0 « x € bo} or (t € 1} where t = 


Sx 


=> as earlier, here also 
A(x) « c e g(t) « a — v(t) «o 


Moreover now v(t) T t Vt <1. 


2 Xi 
se R= (6) <a) es (ee - | 7 | 


where À = x2. 2, due to size-a condition. 


Saurav De (Department of Statistics PresideLRT : Applications under Single Distributions 20 / 36 


LRT : Applications under Single ser 
Distributions 


In this module, the two applications, discussed so far, are based on the 
distributions belonging to OPEF. Now we will consider the distributions 
which are not members of OPEF. 


Application 3. Let X1, ... , X, © common pdf fj(x) = &, x > 0; 6 » 0. 
Suppose we are to test H : 0 < ĝo versus K :0 > 6o. 


Here Qy = {0 : 0 < 0 < 0o], a restricted parameter space and 
QHUQK = (0:0 « 0 < oo}, the unrestricted parameter space. 


The likelihood function of 0 is L(@) = — 9 —,;:0«6 € xq) € .. 


(II) 


-S Xin) 


i.e. L(0) e ea ,0<0< X(1) 


- (Ik) 


= 0, otherwise 


Saurav De (Department of Statistics PresideLRT : Applications under Single Distributions: 21 / 36 


LRT : Applications under Single ser Menta. 
Distributions ü 


=> L(8) + 0 for0<0< X(1)- 
Hence L(@) attains its maximum at 0 = x) = ) 


sup L(0)— 


QUA TT) "E 


Also under Qy, ML estimate of @ is 


Om = Ô, Ô< 4% 


Saurav De (Department of Statistics PresideLRT : Applications under Single Distribution: 


22 / 36 


. . x Ro 
LRT : Applications under Single Ghana 
Distributions 
So the LR criterion is 


A(x) = 


— A(x) | 6. 
——À the LR test CR : A(x) < À < Ó > c, where c is such that 


sup P4[Ó > c] = a 
QH 


where a : given level of significance. i.e. 


sup Pe[Xq) > cl e 
«0o 


Saurav De (Department of Statistics PresideLRT : Applications under Single Distributions 23 / 36 


LRT : Applications under Single ser 
Distributions 


Now Po[Xa; > c] = (Ps > cl)" = n -(8 3. 


1G. C — A and hence 


the CR under LRT is {Xa > ae 


Saurav De (Department of Statistics PresideLRT : Applications under Single Distributions 24 / 36 


. : : KO 
LRT : Applications under Single acu us 
Distributions 
Application 4. Suppose a single observation X ~ Cauchy(0, c) 
distribution with pdf 


f(x) = 


o 
m {a2 +x} 


o<x<o;a0>0. 


Based on this observation only suppose we are to test H : c < oq versus 
K:a0> 00. 


Here Qu = (0:0 < o < oo} — +a restricted parameter space, and 
QHUQK = {a :0<0 < co} — unrestricted parameter space. 


The likelihood function of a is L(c) = Aere 1 O< 9 < 00. 
= ((0) = InL(0) = Ino — Inr — In (c? + x?} 


Hence one will get that under Qj; UQK, om = |x|. (Try yourself) 


Saurav De (Department of Statistics PresideLRT : Applications under Single Distributions: 25:7 36 


LRT : Applications under Single SETTE d l 
Distributions 


= sup i(m)e zki ( putting o = |x| at L(c) ) 
QuUOK 


Also under Qy 
Om = |x| , |x| X 09 


= 09, |x| » vo 


= d M = TTE |x| < 90 
x 2 , |x| > 00 
T fog + rd 
So the LR criterion will be 
A(x) = 1, |x| € oo 


Saurav De (Department of Statistics PresideLRT : Applications under Single Distributions 26 / 36 


LRT : Applications under Single ser 
Distributions 


Thus A(x) 4. |x|. 
So the CR: A(x) < A 4> |x| > c, where c is 5 
sup P,[|X| > c] = a (known significance level) ... (+) 


0 X00 


NowP, [|X] < c] 


| 
| 
E 
P 
pe 
FE 
x 
N 
—— 
X 


Il 
| 
ct 
tD 
> 
p 
Lm 
alo 
NS 


= Po[|X| > c] =1- Ztan“* (4) f o. 


Saurav De (Department of Statistics PresideLRT : Applications under Single Distributions 


27 / 36 


LRT : Applications under Single ser 
Distributions 


— sup PallX| > d = PallX| > e] 21 2tan7 (£). 
o<00 


T0 
hence from (*) we get 1 — 2 tan! ( 3j E 


Or tan! (<) =(1-a)3 


i.e. c = ao tan[(1 — o)5]. 


So the CR for LRT is {|X| > oo tan[(1 — a)3]}. 


Saurav De (Department of Statistics PresideLRT : Applications under Single Distributions 28 / 36 


LRT : Applications under Single ser 
Distributions 


TRY YOURSELF! 


11. 1. Suppose a single observation is drawn from a Cauchy (0, c) 
population. Discuss an LRT for testing H : o = og versus K : o £ op. 
If the observation is - 0. 2978 and og = 1, what will be your dicision 
about acceptance or rejection of H? 


11. 2. Let X1,..., X, ïd common pdf h(x) = f ,X 20; 0 0. Suppose we 
are to test H : 0 = o versus K : 0 Æ 0o. Discuss an LRT for testing 
these hypotheses. 

(Readers themselves can solve it directly if they go through 
Application 3. of this Module) 


Saurav De (Department of Statistics PresideLRT : Applications under Single Distributions: 29 / 36 


LRT : Applications under Single ser es 
Distributions “= 
TUTORIAL DISCUSSION : 

Overview to the problems from MODULE 11... 


M11. 1. The basic set up of this problem is same as that in Application 


4. So we will use the expressions and findings of that application directly 
in this problem as and when necessary. 


Here Qy = (o = oo}, a singleton set, and QW U Qk = {0 :0 < o < oc], 
unrestricted parameter space of c. 


00 
=> supL(a o 
o (7) ES + x?} 
and 
sup L(c) = : 
QHUQK 2n|x| 


(since the MLE of o under Qj, U Qk is |x|, using it in L(c) we get the 
supremum of the likelihood as above) 


Saurav De (Department of Statistics PresideLRT : Applications under Single Distributions: 


30 / 36 


LRT : Applications under Single eter 
Distributions 


So the LR criterion: A(x) = EL — g(t), say where 
?0 


2oot 
= wd 
a(t) oh + t? 


Now check that g(t) is globally maximised at t = co and t versus g(t) 
plot appears as follows: 


and t = |x]. 


att) 
| 


T 
100 


Saurav De (Department of Statistics PresideLRT : Applications under Single Distributions 


31 / 36 


LRT : Applications under Single GQathsnata 
Distributions 


Hence the LR critical region A(x) (= g(t) <A <=> (T <aqorT > c], 
where T = |X| and ci < c» are such that 


g(c1) = g(c2) and P4[T < a]+P.,[T > c] = a (given significance level) 


Now g(a) = glo) = =% = a 
0 


2 
which on simplification finally gives c1c» = o (as ci Z c); ie. = E 


c2 
=> PA[T « a] + Pu[T > 28] =a. 


1 2 _1 90 
te if tan! =a 
00 T C1 


2 
=> —tan 
T 


(as we already know that Pa [T < a] = 2 tan™t 2) 


c0 


Saurav De (Department of Statistics PresideLRT : Applications under Single Distributions: 32 / 36 


LRT : Applications under Single @Qothshata 
Distributions 


On simplification the relation becomes 


2 Wo. -c 

nam 2 1_j-q 
2C100 

2 

oĝ— Cf T 


=> of — cd -—209yaà 


where y = tan [e — o)5] is known for given a. So finally we have 


c+ 2ooya — o8 =0...... (**) 


The solution of (**) considered feasible to this problem, gives the cut-off 
points of the test. 


Saurav De (Department of Statistics PresideLRT : Applications under Single Distributions 33 / 36 


LRT : Applications under Single ser 
Distributions 


For the second part of this problem, we solve (*«) for cı and hence for c2 
also, using R-programing under the choice og = 1 and a = 0.05,0.1. 
Finally in the light of the given value, x — —2.9748 we will decide about 
acceptance or rejection of H for both the levels. 


Saurav De (Department of Statistics PresideLRT : Applications under Single Distributions 34 / 36 


LRT : Applications under Single Bons i 
Distributions un 


Govt. of India 
) 


Computation using R : (to find ‘cy’ and ‘c2') 
R Code and Output : 


> iowa-function(sigma,alpha) 
{ 
gamma=tan((1-alpha)*(0.5*pi) ) 
gamma 

a-1 

b-2*gamma*sigma 
c--(sigma*sigma) 

del=(b*b) -(4*a*c) 
ci1=((-b)+sqrt (del))/(2*a) 
c12=((-b)-sqrt (del))/(2*a) 
c21=(sigma*sigma)/ci1 
c22=(sigma*sigma)/c12 


Saurav De (Department of Statistics PresideLRT : Applications under Single Distributions 


35 / 36 


LRT : Applications under Single 
Distributions 


R Code and Output (continued) : 


if (c11>0) 

1 
caticilo"Xn" cor 
cat ("AnNn") 

} 

else if (c12>0) 

1 

eati Cel? "in'.022) 
cat("\n\n") 

} 


iowa(1,0.05) 


MHRD 


AV, Govt. of India 


a Jathshala 
© S UTO3ITeIT 


} 
> # values of ’ci’ and ’c2’ for different values of ’alpha’ 
> 
0 


.03929011 
25.4517 
> iowatl,0.1) 
0.07870171 
12.7062 


Saurav De (Department of Statistics PresideLRT : Applications under Single Distributions 


36 / 36 


Ë Jathshala 
e S UTOSITeIT 


LRT : Applications under Two Comparable Single 
Distributions and Bivariate Distribution 
Module 12 


Saurav De 


Department of Statistics 
Presidency University 


Saurav De (Department of Statistics PresideLRT : Applications under Two Comparable S 19/533 


LRT : Applications under Two @Qothshata 
Comparable Single Distributions 
and Bivariate Distribution 


Application 1: X,,..., Xm ^? N(j3,02) and Ys,..., Yn ^2 N(p2, 03) 


Suppose we want to test H : c1 = 9» versus 
(a) K3:01Z 02, (b) K2 : 02 > 01 


Here 0 = (p11, 2,01, 02). 
Qu = {6 : u1, u2 E R, 02 = 01 =0,0 > 0}, 


QHUQNk = {0 : y1, u2 E R, 01,02 > 0) — fully unrestricted parameter 
space, and 


QyUQK = 40 : m, u2 E R, 02 > 04» 0) — partially restricted 
2 HiH 
parameter space w.r.t. 01,02. 


Saurav De (Department of Statistics PresideERT : Applications under Two Comparable Si 2/33 


i : WO 
LRT : Applications under Two aus 
Comparable Single Distributions 
and Bivariate Distribution 
The likelihood under Qj U Qk 


1| 9.60648 Soy — ey 
L(0) = C(oe?,c2 | 
(0) = coboy- | 7 3 
1 2 zo 2 = AD 
e C(o?,03) ex { pm m) j Realy H2) |} 
2 OT 05 


2n 1 2 2 
where C(o1,05) = yA > Si and s5 are the sample 


variances of X and Y with divisors m and n respectively. 
(a) Under Qy U Qk, MLE of 0 is (x, y, s, 55). 


max  L(0) : ex { my n) 
x = 
BENUR (V 2T) m+n(s2)m/2(s2)n/2 p 2 


Saurav De (Department of Statistics PresideLRT : Applications under Two Comparable Si 31/193, 


= 


LRT : Applications under Two @Qothshata 
Comparable Single Distributions 


and Bivariate Distribution 
Also under Qy 


L(6) = car 
(Vx)m*(o2)58 


1 _ = 
exp f- [mst + m(x — i + ns? + n(y — uz] } 


" ^ 24 ns2 S 
= MLE of c? will be 6? = 751772. — pooled sample variance. 


— max L(6) : e { nm n) 
max = X $ 
ean (Vaeegnm i a 


m n p m+n 
So the LR criterion: => A(x) = (si)" (s)?  t"(mkn) 2. 


m+n —— 


(62) "2" (nt-m) 2^ 


s2 
where t = 3. 
Si 


Saurav De (Department of Statistics PresideLRT : Applications under Two Comparable S 4/33 


LRT : Applications under Two eter 
Comparable Single Distributions 
and Bivariate Distribution 


Now 


t^/? 
A(x) « c € g(t) = — r «a -wut)«o 
(nt + m)? 


where w(t) = Ing(t) = 3 Int — TE In(m + nt). 


dir) > 0 a ae mmi O t = 1 (on simplification) 


Also here «"(t)|—1 < 0 


=> v(t) has global maximum at t = 1. 


Saurav De (Department of Statistics PresideLRT : Applications under Two Comparable S 5/33 


LRT : Applications under Two CQ atnshata 
Comparable Single Distributions 

and Bivariate Distribution 

Thus y(t) < co 4> (t«laort» ko} la < lo. 


<=> {Fp < Ai or Fp > 22] , 1 < A2. 


“2H 
where Fy = ==> ~ Fn 4,51 and Aj, Az are such that 
1 


Pu [Fu < X1] + Pp [Fu > 22] = o (size — a condition) ... (x) 


and (I1) = v(ko) or g(/a) = g(k2) i.e. 
E o2 


= a) esa (sx) 


In — F 
(nk +m) 2^ (nk+m) 2- 


Equal tail probability assumption doesn't need (*«) and simplifies the 
solution of (*) with Ay = Fi-e;n-im-i and A5 = Fo:n-im-l though it 
no longer remains an LRT then. 


Saurav De (Department of Statistics PresideLRT : Applications under Two Comparable Si 6 / 33 


atn: Jathshala 


LRT : Applications under Two UTStar 


Comparable Single Distributions 


and Bivariate Distribution 
2 

(b) Under Qu U Qj, = [o nua € R, S > i} 
il 


2 2 
[2 S E 
MLE of à = 3,331 
a Spo a 
2 
S 
1, 4<1 
si 


As L(0) under Qu U Nk, = L(0) under QH U Nk, this implies 


1 m+n Sj 
max 1(0) = ex ; 21 
DEHUR (9) (2m) ™+7(52)m/2(52)n/2 Pt 2 | si 
1 m+n Se 
= mm ex p ? 2 < 1 
(/2r)m+n(G2) "2 2 Si 


Saurav De (Department of Statistics PresideLRT : Applications under Two Comparable S 7733 


LRT : Applications under Two S ETE ) i 
Comparable Single Distributions 
and Bivariate Distribution 


Accordingly 
m/2( n/2 2 
no) = Soi 
(6?) 2 E 
2 
= 1, 2 5 < 1 (case of trivial acceptance of H) 
si 


Thus the CR: Ax) «cC (t2 1), t S 


So 
si 


Now à(x) < c 4> g(t) = 2 


mpm <1 => Y(t) = Ing(t) < o. 
(nt+m) Z? 


Saurav De (Department of Statistics PresideLRT : Applications under Two Comparable S 8/33 


LRT : Applications under Two eter 
Comparable Single Distributions 
and Bivariate Distribution 


But y(t) is maximum at t = 1 and w(t) | t for t > 1. 


E 
Thus Y(t) < c) => {t > kj => E >a} 
1 


ie. the LRT CR: (Fu >A}, Fy = ER M E, ua. 


The size-a condition gives A = Fy:n—1,m—1- 


Note : H is accepted == homoscedasticity i.e. equal variance holds. 


9 / 33 


Saurav De (Department of Statistics PresideLRT : Applications under Two Comparable S 


LRT : Applications under Two @Qothshata 
Comparable Single Distributions 
and Bivariate Distribution 


Second Part Now, given homoscedasticity, consider the following 
hypotheses to test: 


H : wo — pı =O versus (a) Kı : y2 — pi Z0, (b) Ko: y2 — pı > 0 
= here the parameter 0 = (11, 112,0?) , à? : the common variance. 
Qy = (0 : pi, u2 € R, uo = pf, 0 > OF 


QyUOK, = (0:1, u2 € R,o > 0) and 


QH UNK = (0: —oo < i1 € u2 < œ,0 > 0} — restricted parameter 
space w.r.t. (u1, u2). 


Saurav De (Department of Statistics PresideERT : Applications under Two Comparable Si 107/233 


LRT : Applications under Two S ETE 
Comparable Single Distributions 
and Bivariate Distribution 


The likelihood under QH U OK : 


1 1 
L(0) = — ex ms? + m(x — m)? + ns? + n(y — 1 
(0) Gron of 353 (msi + m — i) + 53 + n(Y — ua ] 
ms?--ns2 
We get MLE of (u1, 12,0?) as (x, y,6?) , 6? = oe 


1 
= max L(0)= — eet m n) 
PEQHUQK, (2762) z 2 


Saurav De (Department of Statistics PresideLRT : Applications under Two Comparable S 117/33 


c fo 

LRT : Applications under Two Ag ween 
Comparable Single Distributions 

and Bivariate Distribution 

Under Qy : let wi = u2 = u 


1 1 
= L(0) — mur XP z [msi + m(x — u)? + ns? + n(Y — uy] 
(21a? 20 


) 2 


Under Qy the MLE of u is f; = EZ and MLE of c? is 


> _ ms + ns? + md? + nd? 
m+n 


where di =X — fi, do — y — fa. 


1 m+n 
= L(0) = 
bed, (9) (2762) "2" eet 2 } 


Saurav De (Department of Statistics PresideLRT : Applications under Two Comparable S 


12 / 33 


LRT : Applications under Two eter 


Comparable Single Distributions 
and Bivariate Distribution 


^2 m+n 
fon 2 
— x0) (F) 
82 
ms?4ns? 
h m+n 
The CR: A(x) < c => ERE TIE PE <a 
myn 
md? + nd? 
< 1+ C2 


ms? + ns? 


13 / 33 


Saurav De (Department of Statistics PresideLRT : Applications under Two Comparable S 


LRT : Applications under Two eter 
Comparable Single Distributions 
and Bivariate Distribution 


On simplification it becomes 


(yo xy 2 ms? + ns? 
a Dee A= er 
go es dam) mte 
Or T >À 
Av al 
CES 
where — 
Y—-x H 
ty = ~ tm+n—2 


The critical point À = t2:m4n—2 at level a. 


Saurav De (Department of Statistics PresideLRT : Applications under Two Comparable S 14 / 33 


LRT : Applications under Two GU ETE 
Comparable Single Distributions 
and Bivariate Distribution 


(b) Under Qj, U Qk, where 0 € u2 — u1 < oo 


MLE of u- u1 = y-x,y-x20 


= 0, otherwise 


Thus 
1 
max L(#) = —— evt zi .y-x20 
6€O4UQ, (2182) 2 2 
: { n y-x«0 
= Ra min ©XP )y—X 
(2182) ? 2 


Saurav De (Department of Statistics PresideLRT : Applications under Two Comparable S 


15 / 33 


LRT : Applications under Two ser 
Comparable Single Distributions 
and Bivariate Distribution 


Accordingly 


~ N 


case when H is trivially accepted) 
= GR 9A) <ic G{y— x > 0} 
Now A(x) < c ==> 25 < cq = on simplification 


y-x 


> À [as here y — x > 0] 
Slat a) 


Size-o condition —» A = ta:m+n-2- 


Saurav De (Department of Statistics PresideLRT : Applications under Two Comparable S 16 / 33 


LRT : Applications under Two GQathshala pr 
Comparable Single Distributions - 
and Bivariate Distribution 

Application 2: Let X;,..., Xm ïd common pdf 5 , X > 01 » 0 and 


Yaseea5 Yn ïd common pdf p ,y 2 09 0 independently. 


Suppose we are to test H : 04 = 05 against K : 04 Æ 05 
Here Qy = (0:01 = 02 = 0; 0 < 0 < oo; } and 
Qu Ue = {(01,02):0 <0; < oo, =LA 


Under Qp U Qx the likelihood function is 
0705 


m n 
HI; 


j=1 j=l 


L(61, 02) = 53:01 € Xa), 02 € Ya) 


Obviously L(01,02) T 0;, 1 = 1,2. 


Saurav De (Department of Statistics PresideERT : Applications under Two Comparable Si 1312/33 


a Jathshala 
e S UTOSITeIT 


LRT : Applications under Two 
Comparable Single Distributions 


and Bivariate Distribution 
Hence MLE of (61, 62) is (Xa); Ya). => 


Xi 
sup L(61,02) = __. 
QHOQAK m n 
II» 
i=1 j=l 
Similarly under Qy the likelihood function is 
pmrn ; : 
L(0) = F ë . 2 i 0 < X(1) , 0 < Y(1) i.e. 0 < min {xaya} . 
Hli 
i=1 j=l 


Obviously L(0) t 0. And hence MLE of 0 is min {Xa} Ya} ' 


Saurav De (Department of Statistics PresideLRT : Applications under Two Comparable S 18 / 33 


LRT : Applications under Two a 
Comparable Single Distributions 


and Bivariate Distribution 
= sup £(6) = (min) 7 7 
Qu m n 
ii 
i=1 j=l 
So the LR criterion is 


sup L(@ f, m+n 
hf P2 _ (min 993) 
i sup L(61,0 xm yn 
houp LOL 62) (XQ) 
S m+n n 
Now tuu 69) = (33) ; if min Day Xo) — X(1) 
Xa) ya 


) 
w) "T 
Ee , if min 4X1), = 
(m aym} = Yay 


Saurav De (Department of Statistics PresideLRT : Applications under Two Comparable S 19 / 33 


LRT : Applications under Two Ag ween 
Comparable Single Distributions 


and Bivariate Distribution 
So the critical region 


06 «x = { (2) « A Or (22) <a} d teh < 1. 
Ya) Xa) 


ie = [38 cn or "0 3, E t 
Ya) ya) ^ Am 


xe e 


where A is > Pu[y i. < Mn] + Pulse Sal = = a (the given 


significance level). 


Now under H the pdf of 20 l at u will be 


mn 
g(u) m pa, „0<u<1 
mn 1 ; 
= —— ,u> 1 (left to the readers as an excercise) 
m+n umi 


Saurav De (Department of Statistics PresideERT : Applications under Two Comparable Si 20 / 33 


LRT : Applications under Two S ETE 
Comparable Single Distributions 
and Bivariate Distribution 


Al/n 
X, 
^(1) 1/n _ mn ,,n—i mA. 
So Pu [id < An] = ] 2 dy = 2A 
0 
oo 
mn = 
and Palya l- f mn du = z^. 
Ld. 
l/m 


+ = aie \= a. 


Thus we have from size condition that n 


m 


=> the size-o CR of the LRT is [se at Or ya e | : 


Saurav De (Department of Statistics PresideLRT : Applications under Two Comparable S 21/33 


LRT : Applications under Two @Qothshata b 
Comparable Single Distributions 
and Bivariate Distribution 


Application 3 : Application on bivariate normal population : 
7 
(Xi, Yı), DD (Xn, Yn) ~ BVN(p1, H2; gj; o$, p) 
Parameter 0 = (u1, 12,02, 03, p) 


Null hypothesis H : p = 0 versus (a) p 40 and (b) p > 0 for testing. 


=> Qu = (0: m, u2 € R,01,02 > 0,p = 0} 


Qu UO, = (0:1, u2 € R,01,02 > 0,|p| < 1} — unrestricted 
parameter space. 


Saurav De (Department of Statistics PresideLRT : Applications under Two Comparable Si 225/533 


LRT : Applications under Two atn: 
Comparable Single Distributions 
and Bivariate Distribution 


Under QH U Qk, the likelihood 


1 1 
~ Qxy(elelü — py eel 2 ao} 


y» m ¢ Suri)? hme 


where Q(0) — a2) 


2 2 
oi 95 0102 


n. [stt(x-iiy y Gy 2pl r sisa-- (X—)(Y—12)] 
^ (1=p?) oi o3 0102 


where r = sample correlation coefficient on (X, Y). 


Saurav De (Department of Statistics PresideLRT : Applications under Two Comparable S 23 / 33 


LRT : Applications under Two @Qothshata 
Comparable Single Distributions 
and Bivariate Distribution 


Under unrestricted case, we know from Module 5 that, MLE of 
(m, 2, 02, a2, p) is (x, Y, m. r). 


1 
L(0) = -n 
7 odiis Baeza yn 


Also under p = 0, X and Y are independently distributed N(u1, 02) and 
N(u2,03) variables. => MLE of (11,02) = (x, s?) and MLE of 
(u2, c2) CH (Y. s2). 


een, H= o se F 


Saurav De (Department of Statistics PresideERT : Applications under Two Comparable Si 24 / 33 


LRT : Applications under Two eter 
Comparable Single Distributions 
and Bivariate Distribution 


pey O) 


— ail 2\n/2 
= Ap max  L(0) p M dg 
OEQHUN K 
Now A(x) < c => |r| > a and V1— ° < c. 
4 Avn- vn=2lr rie ES vn—2r SA 
V1—r2 


From sampling distribution we know 


" syne m 
koc 


=> r= ta.n—2 at level of significance a. 


Saurav De (Department of Statistics PresideLRT : Applications under Two Comparable S 25/33 


LRT : Applications under Two 
Comparable Single Distributions 
and Bivariate Distribution 


à Jathshala 
eS USM aT 


(b) Under Qj, U Qk, : p > 0 — restricted parameter space of p, 


MLE ofp = r,r20 
= 0,r<0 
= max L(0) e”, r>0 
x => 
6€ Q UO, (21)^(s2s2(1 — r2))n/2 iiber: 
1 
= mae RO 
QryGHSym ci 
= Ax) = ü-2y?,r20 


1, r « 0 (case of trivial acceptance of H) 


Saurav De (Department of Statistics PresideLRT : Applications under Two Comparable S 


26 / 33 


LRT : Applications under Two eter 1 i 
Comparable Single Distributions 
and Bivariate Distribution 


= CR: A(x)<cC{r>0} 
Also À(x) < c 4> 1- <a = D» 


<=> r> g (as > 9). 


So the LRT CR: Ax) < c => VX > À. 


Level of significance = a = > A= ta;n—2- 


Saurav De (Department of Statistics PresideLRT : Applications under Two Comparable S 


27 / 33 


LRT : Applications under Two @Qothshata b 
Comparable Single Distributions 
and Bivariate Distribution 


Application 4 : 
(Xi, Y)... (Xn, Yn) ^? BVN(u, 12, 07, 05, p) 
Parameter 0 = (u1, 12,02, 03, p) 


2 2 
Let the null hypothesis H : % = ¢3(known) versus (a) Ki: £3 4 ($ and 
23 92 


2 
(b) Ko : Z > C3 for testing. 
oD 


Let us make linear transformations like U; = X; + Go Y; and 

Vi = Xi — Yi ,1— ,,2,...,n. 

=> (U, Vi), ..., (Un, Vn) also follow iid BVN with population corr puv, 
Say. 


Saurav De (Department of Statistics PresideERT : Applications under Two Comparable Si 28 / 33 


LRT : Applications under Two Ag a 
Comparable Single Distributions 
and Bivariate Distribution 


Obviously Cov(U, V) = o? — (203. => Cov(U, V) — 0 under H. 
As a result pyy = 0 under H. Hence in terms of the tranformed variables 
(U;, Vj)s our testing problem boils down to testing 


H : puw = 0 versus (a) Ks : pu, #0 (b) K2 : pu, > 0 


Now proceeding exactly in the same way as the earlier application, the 
size-a@ LR critical region for H versus Ki is 


[t| > ta -n—2 


and the same for H versus Ko is ty > ta-n—2, where ty = pA 


following t,» distribution under H and ruy : sample correlation coefficient 
n (U, V) based on n observations. 


Saurav De (Department of Statistics PresideLRT : Applications under Two Comparable Si 


29 / 33 


LRT : Applications under Two Ag press 
Comparable Single Distributions 
and Bivariate Distribution 


Application 5 : Let X ~ N(j,1) and Y ~ Cauchy (j12, 1) independently. 
To test H : jjj = p2 = 0 versus K : not H. Here the likelihood is 


1 1 1 
L(m, #2) = AB A ET 7 n] 7 (1-4 (y — ua) 


and hence the loglikelihood is 


1 
f(u1, 12) = Const AG m)? In {1 + (y — m2)°}. 


Now it would not be difficult to show that ML estimate of (u1, 12) is 
(x, y) which maximises L. 


= sup L(u, u2) = 


Also pe L(ui, p2) = Je 2x a Hy): 


Saurav De (Department of Statistics PresideLRT : Applications under Two Comparable S 


30 / 33 


LRT : Applications under Two Ag press 
Comparable Single Distributions 
and Bivariate Distribution 


==> the LR criterion is A(x, y) = ru ds 


=> the CR is A(x, y) < A where A is 5 


Pu|A(X, Y) < A] = o (given significance level) 


where X Ë N (0, 1) and Y ys Cauchy (0 , 1) independently. 


Note. Since here no explicit solution for is possible from the size 
condition (x) hence using simulation technique through an R-program we 
can get the empirical solution for À w.r.t. some given value of œ = 0.05 or 
0.1 etc. 


Saurav De (Department of Statistics PresideLRT : Applications under Two Comparable S 319/193 


MHRD 


ovt. of India 


LRT : Applications under Two Qenan 
Comparable Single Distributions 
and Bivariate Distribution 


Computation using R : (to find suitable A for a = 0.05) 
R Code and Output : 


> library(varhandle) 
> lambda=NULL 

> for(j in 1:1000) 

1 


x-rnorm(10000,0,1) 

y=rcauchy (10000,0,1) 
W=exp(-(x*x)/2)/(1 + (y*y)) 
W1i=sort (W) 
mat-as.data.frame(table(W1)) 
CF=NULL 

sum=0 


Saurav De (Department of Statistics PresideLRT : Applications under Two Comparable S 32/33 


LRT : Applications under Two eter 
Comparable Single Distributions 
and Bivariate Distribution 


R Code and Output (continued) : 


for(i in 1:length(mat[,1])) 

1 

sum=sumtmat[,2] [i] 

CF [i]=sum 

} 

mati=cbind (mat ,CF) 
lambda[j]=unfactor (mati[,1] [which(mati[,3]==500)]) 
x=NULL 

y=NULL 

W=NULL 

} 

> # usable lambda/cut-off value 
> mean (lambda) 

[1] 0.003076692 


Saurav De (Department of Statistics PresideLRT : Applications under Two Comparable S 331/133, 


Ë Jathshala 
e S UTOSITeIT 


LRT : Applications under k (> 2) comparable single 
distributions and Multivariate Distribution 
Module 13 


Saurav De 


Department of Statistics 
Presidency University 


Saurav De (Department of Statistics PresideLRT : Applications under k (> 2) comparable 1/34 


LRT : Applications under Qanman 
k (> 2) comparable single distributions 
and Multivariate Distribution 


Application 1. 
ə Consider k 4 2) independent normal populations 
N(m, oi). N(ux, o). 
e Let 15... "m. be a random sample of size n; drawn from 
N(ui,0?), i=1,...,k. 
ə Y n; = n is the total sample size. 


e Suppose we want to test H : jj = -:: = uk = p (say) versus 
alternative K : not H. 


Saurav De (Department of Statistics PresideLRT : Applications under k (> 2) comparabl 2/34 


LRT : Applications under ser 
k ( 2) comparable single distributions 
and Multivariate Distribution 


e To avoid the Behren-Fisher's type problems, we have to assume 


c? —...— o? = a? (say) (assumption of homoscedasticity). 
e Under homoscedasticity assumption, 0 = (11,..., 1,0) and 


Qu = (ui... Uk, 07) : uj = p, —00 < u < 00;0 > 0) and 
Qu U Qk = (us... Bk EA) : —0o < ui < œ,i = 1,...,k;o > 0} 


Saurav De (Department of Statistics PresideLRT : Applications under k (> 2) comparabl 3/34 


LRT : Applications under Ag 
k (> 2) comparable single distributions 
and Multivariate Distribution 


e Now the distribution of 0 under Qu U Qk is : 


L(8) = bea Dh Dam 
(V2m0)" 
e With the ip di i or routine derivation, we get the MLE of ji; 
or —x- i P | , Xj and the MLE of c? as : 
= =f 1 za — xQe cm. 
L ed (xp)? 
e Similarly under us L(0) = (isst z52 Mia MEya04-nY 


e It can be shown here, i.e., under Qy, the MLE of y is : 
pax a and the MLE of c? is 
ge 


At 3»3 " 2 ui x) — Total? — TSS, 


Saurav De (Department of Statistics PresideLRT : Applications under k (> 2) comparabl 4/34 


LRT : Applications under ser 
k ( 2) comparable single distributions 
and Multivariate Distribution 


1 n 

Hence, max L(@) = —————Á—.e ? 
0€ QuUOK« (vV21)^(o2)2 
1 n 
and max Lip) = ———e8 ? 


3d (V2m)"(02)2 


. => The LR statistic : 


R maxeco,uoy, L(A) 


Xo maxgeq,, L(A) | 


Saurav De (Department of Statistics PresideLRT : Applications under k (> 2) comparabl 


5 / 34 


LRT : Applications under ser 
k (> 2) comparable single distributions 


and Multivariate Distribution 


W 
Here, CR: Ax) «C o SSW 


Ts ^ 

^» ZU x» SUY * C 

N ES a E > C (on simplification) 
"Ee Tai >G 


Saurav De (Department of Statistics PresideLRT : Applications under k (> 2) comparabl 6/34 


LRT : Applications under GQathanen pin 
k (> 2) comparable single distributions 
and Multivariate Distribution 


Actually the problem that we have considered here is the ANALYSIS OF 
VARIANCE problem is its simplest form with one-way classified data under 
fixed effect model. 


— the ultimate statistic what we get here is : 


SSB/(k—1) MSB p 


FH = SSWJ(n—k) MSW obn 


We reject H if and only if Fy > C3 i.e., if and only if the observed value of 
Fy, is significantly large. 
From size o condition, we get : C3 = Fu; 1,5 X. 


Saurav De (Department of Statistics PresideLRT : Applications under k (> 2) comparabl 7 / 34 


LRT : Applications under ser 
k ( 2) comparable single distributions 
and Multivariate Distribution 


Application 2. 
e Next, suppose we want to test H : of =--- = o? = o? (say) versus 
K: not H. Now 0 = (yi, ..., Hk 01,--+5 0k): 
ə => Qy = {0 —oo u; < oo, 7 — 1,.. IX Dp'—eoevi- 
d Ec o 4] 
e and OG U Qk = (0 : —oo < pj < oo; o? >0Yi=1,..., k}. 
e Under Qj U Qk, 


j-e)? 


o2 
i 


—LSok a 
L(0) A 1 e 2291 j=l 


Saurav De (Department of Statistics PresideLRT : Applications under k (> 2) comparabl 8/34 


LRT : Applications under ser 
k (> 2) comparable single distributions 


and Multivariate Distribution 


Here, we can easily derive the MLE of u; to be x; and MLE of o? as 
s? = + ji — xj)* — the sample variance with divisor n; for ith 


population, i = 1,..., Kk. 


Map a 
7 i=1 fi 
Again, under Qy, 
: DDA 
L(0) = ———— e z2 GEENE ee 
KTL 


Saurav De (Department of Statistics PresideLRT : Applications under k (> 2) comparabl 9/34 


Gorm Jathshala 


LRT : Applications under VTOSITeIT 


k ( 2) comparable single distributions 
and Multivariate Distribution 


Here, MLE of ju; is as usual xj, but now MLE of c? is o2 = Li Zia ns — a 
weighted average of sample variances, weights being pre Ronal to the 
sample sizes. 


= Now, supgeq, L(A) = n ee 2, 


(V22)no?2 
— the LR statistic : 


E k AE 

»-I(3) "dig. 
a. i = i zs 
i NU" (G Eians) 


Saurav De (Department of Statistics PresideLRT : Applications under k (> 2) comparabl 10 / 34 


LRT : Applications under ser 
k ( 2) comparable single distributions 
and Multivariate Distribution 


The Critical Region : A(x) < C 
K niji 
PN (Ia GP) < G 


isk 2 
222i NS} 


weighted GM of s? 


NT TTE « Ci, weights being sample size. 


Saurav De (Department of Statistics PresideLRT : Applications under k (> 2) comparabl 11/34 


LRT : Applications under Qanman 


k (> 2) comparable single distributions 
and Multivariate Distribution 


, : : 247 po m T nj A 2 
e Since an unbiased estimator of o7 is s;^ = 54 D044 (xj — Xi), the 


sample variance with divisor (n; — 1) for ith population, replacing s? 
by s;?, the CR can alternatively be written as : 
weighted GM of s? 


weighted AM of E 2 
= total weight. 


e an improved LR statistic (in terms of s;?) is : 


[(2)^ ...(s2)^]" 
vs 24 us, ? ' 
v 


< C, weights being v; = n; — 1 and $`; vj = v(say) 


A = 


Saurav De (Department of Statistics PresideLRT : Applications under k (> 2) comparabl 12 / 34 


LRT : Applications under Qanman 
k (> 2) comparable single distributions 
and Multivariate Distribution 


For computational purposes, it is better to use the statistic : 
M = —vlog.(A*), which was suggested by Bartlett. 


^ 1 
ns? ee vy s,? 
V 


=> M= vloge{ \ viloges,? ee viloges,?. 


Now, let us consider the following two approximations to the asymptotic 
null distributions of test statistic. 


Saurav De (Department of Statistics PresideLRT : Applications under k (> 2) comparabl 13 / 34 


LRT : Applications under Qanman 
k (> 2) comparable single distributions 


and Multivariate Distribution 


ə 1st approximation : For large vi,..., v4, the asymptotic null 
distribution of M is central x? with degrees of freedom 
2k —(k+1)=k-1. 

e 2nd approximation : If v;'s are not large, i.e., in small sample case, 
Bartlett proved that the statistic 


/ M 


MNA. O7 
C ? 
1+ suy 


where, C1 — 4 1 .. 1) has the null distribution tending to 
lr; 


v 


central x? with faster rate than M itself. 


Saurav De (Department of Statistics PresideLRT : Applications under k (> 2) comparabl 14 / 34 


LRT : Applications under Qanman | 
k ( 2) comparable single distributions 
and Multivariate Distribution 


Application 3. Let Y; ~ N(u + a; + Bj, o?) independently for all 

i= 1,...,p and for all j =1,...,q. So the data are framed as two-way 
classified data as if with a factor A having p levels with additional effects 
os and with another factor B having q levels with additional effects 5;s. 


p q 
Also assume » — 0, Vj and y» =0,Vi. 
i=1 j=1 
= the parameter 0 = (ulti; Bo? ; i 1,..., p, J =1,...,q). 


Suppose we want to test Hı : a; = 0 for all į versus Ki : not H1, ( for all j 
). And H2 : B; = 0 for all j versus Ko : not H2, ( for all i. ) 
So Hı « factor A not significant and H <=> factor B not significant. 


Saurav De (Department of Statistics PresideLRT : Applications under k (> 2) comparabl 15 / 34 


LRT : Applications under Ag ween 
k (> 2) comparable single distributions 
and Multivariate Distribution 


Under Qj, U Qk, the loglikelihood of 6 : 


p q 
n 2 1 2. — 
(0) = Const. — 5 Ino? — 755 7) "(yj - n — ai— Bi): n= pq 


j=1j=1 


So here the ML estimates are equivalent to Least Square estimates. Hence 
here the ML estimate of (4, œ;, Bj) is (y , y; — Y.Y j — Y..), where 


i=1j=1 


p 
= Sy yu ( grand mean) = 2» MY — 237 
PER 


Saurav De (Department of Statistics PresideLRT : Applications under k (> 2) comparabl 16 / 34 


LRT : Applications under GQathanen 


k (> 2) comparable single distributions 


and Multivariate Distribution 
Thus the ML estimate of c? becomes 


124 
6 = MR ys y v )- 75 OY 


i=1j=1 


which, on simplification, becomes 


p q p q 
8? = D7 -yy- qi. -yy- p» j -XJ 
= j=l 


j=1j=1 


= - {TSS SSA SSB} = ~SSE 


where TSS = Total Sum of Squares , SSA = Sum of Squares Due to 
Factor A, SSB = Sum of Squares Due to Factor B and SSE = Sum of 
Squares Due to Error 

Saurav De (Department of Statistics PresideLRT : Applications under k (> 2) comparabl 17 / 34 


LRT : Applications under GQathanen pin 
k (> 2) comparable single distributions 


and Multivariate Distribution 
Under Qp, the loglikelihood of 6 : 


p q 
ES np "2 1 2 
£(0) = Const. 5 Inc 252 39307 — p— Bj). 
i=1j=1 
For the same reason here also the ML estimates of jj and 5; remain 
unchanged and as a result now the ML of o? becomes 


À 124. ' | 6€ 
D. MEN iy. GFX 
gi 
1 P q q 
= 2492 0-7.) — P» 7-7.) | (on simplification) 
ET -1 Jed 


A . [159-568] : {SSE + SSA} 


Saurav De (Department of Statistics PresideLRT : Applications under k (> 2) comparabl 18 / 34 


LRT : Applications under S ETE 
k (> 2) comparable single distributions 


and Multivariate Distribution 


= sup L(0)-— —1 _, exp {—n/2}, and similarly 
m UO (2767) 


sup L(@),— — 2s exp (472 
sup L(0) = yn P Con] 


A2 


n/2 
=> the LR criterion is (y) = ($ ) : 


G2 


The critical region is : A(y) <A —— & < A. 


VIL. 
N| N 


Or SER > A2 Le. equivalent to Ed > As. 


Saurav De (Department of Statistics PresideLRT : Applications under k (> 2) comparabl 19 / 34 


LRT : Applications under Ag press 
k ( 2) comparable single distributions 
and Multivariate Distribution 


Now, from orthogonal splitting (due to Cochran's Theorem) we know that 


SSE SSA : koppi s SSE 2 

75 and #7 are independenlty distributed with 25 ~ X(p—1)(q-1) and 
SSA th | 2 

o? C X(p-iy 


SA /p— 1 


Hence the CR can equivalently be written as Fy, = SE (o-1 0-1) >c 


where Fy, * p—1,(p—1)(q-1) and c is such that Pus [CR] — ^ (given 
significance level) 


= € = Fyp-1(p-1)4-1) : upper 100y % point of Fp_1(p-1)(q-1) 
distribution. 


Similarly we can perform LRT for H» versus Ko. 


Saurav De (Department of Statistics PresideLRT : Applications under k (> 2) comparabl 20 / 34 


LRT : Applications under atn: 
k (> 2) comparable single distributions 
and Multivariate Distribution 


Application 4. Next, suppose X1,..., Xn be i.i.d. N5(p, X). Here 
0 = (p, 2). 


Let us test H : u = po (known) versus K : p Z po. 


=> Qy = ((p, X.) : u = po, È is p.d.) and 
Qu UQk = ((p, X): —oo < u; < 00,2 is p.d.). 
Under Qy U Qç, the likelihood is : 

1 


L(8) = ——pp—— e à Ealan) Ea) 
(21)? (|X|)? 


Saurav De (Department of Statistics PresideLRT : Applications under k (> 2) comparabl 


21 / 34 


LRT : Applications under Qanman 
k (> 2) comparable single distributions 
and Multivariate Distribution 


S L(8) = > eH) EH) + X543) Exa) 


(V2m)nP|x|? 


> 1 n 
where, X = 255, 1Xa — sample mean vector. 


Or, 


(9) = 1 1 {n(x—ps) E- (x-) +n tr. ZV} 


(V2m)"|Z|2 


c ERN . . 
where, V = i57,(x, —X)(Xa — X) — sample variance - covariance 
matrix with divisor n. 


= MLE of p is x and of È is V. 


E L(0) : e? 
max = ————————,. 
OENHUNK (V 2r)"P|V]|2 


Saurav De (Department of Statistics PresideLRT : Applications under k (> 2) comparabl 22 / 34 


LRT : Applications under Qanman 
k (> 2) comparable single distributions 


and Multivariate Distribution 
Similarly under Qy, 


1 lyon !'yY-i 
L(0) = -e73 a1 (Xa H0) X (xa no) 
S (v2m)"P|-|? 
s 1 —5tr. X 7t Vo 
(v2n)"P|*-|2 


where, Vo = i(x, — po) X! (x, — uo). 


=> (0) = logeL(0) = const. + Z loge! Vo| — pti Vo 


p p 
= const. + 29 loge(vi) — D vj = Y(v) (say), 


i=1 i=1 


where, v;'s are the eigen values of x VW. 


Saurav De (Department of Statistics PresideLRT : Applications under k (> 2) comparabl 23 / 34 


LRT : Applications under ser 


k ( 2) comparable single distributions 
and Multivariate Distribution 


n yw) _ . 
= 2X 0 ð — 0 Vj z k. 


= w(v) and hence, /(0) gets maximised for v; = 1, Vi , i.e., for 
PO VL. 


=> under Qy, the MLE of X is È = Vo. 


1 n 
= max L(0 -7 


HEN )= e Zs (1) 


Saurav De (Department of Statistics PresideLRT : Applications under k (> 2) comparabl 24 / 34 


LRT : Applications under Qanman 
k (> 2) comparable single distributions 
and Multivariate Distribution 


max 5 
=> the LR statistic is: A = oeny 19) -(&) . 


maxgco uo, L(0) ~~ V |] 


The equivalent statistic is : A» = = is called Wilk's Lambda. 


CR: Ac«C ON G 


or, H < Cy, where, Ê = 137^ (xa — no)(xa — uo) - 
Now, 


Saurav De (Department of Statistics PresideLRT : Applications under k (> 2) comparabl 25 / 34 


LRT : Applications under Gorm 
k (> 2) comparable single distributions 
and Multivariate Distribution 


Hence, È = V + (X — uo)(X — uo) . 
Now, under Qy, 


y = V/n(x — uo) ~ N,(0, X). 
and S = bate — X)(xq — X) ~ Wishartp(}, n — 1) independently. 


— under Qy, T? =(n—1)y'Sly ~ ~ Tipa ; (Central T? distribution with 
dimension p and degrees of freedom n — 1). 
Now, 


V= : => T? = n(n- 1) V-ly. 


Saurav De (Department of Statistics PresideLRT : Applications under k (> 2) comparabl 26 / 34 


LRT : Applications under @Qethsnata d 
k ( 2) comparable single distributions 
and Multivariate Distribution 


Hence, © = V+ Y-, So, the CR is: IVe G => M g G. 
d I [V+] 


As, IV + X-| = (1 + 2), so the CR is: —+,— < G. 


n2(n—1) 


< T? > C, where from size a condition : 


where, T2, .1i(0) — upper 100a% point of T? distribution with 
dimension p and degrees of freedom n — 1. 


Saurav De (Department of Statistics PresideLRT : Applications under k (> 2) comparabl 27 / 34 


LRT : Applications under Qanman 
k (> 2) comparable single distributions 


and Multivariate Distribution 


TRY YOURSELF! 

13. 1. Let Y; ~ N(u + ai + 8j, 0?) independently for all i = 1,..., p and for 
all j = 1,...,q. So the data are framed as two-way classified data as 
if with a factor A having p levels with additional effects o;s and with 
another factor B having q levels with additional effects 5;s. 


p q 
Also assume Soa; =0, Vj and 9» —0,Vi. 
i=1 j=l 


If & is known, show that the LRT for ANOVA is performed by the x? 
test statistic. 


Saurav De (Department of Statistics PresideLRT : Applications under k (> 2) comparabl 28 / 34 


à Jathshala 
^ Q 


LRT : Applications under 
k (> 2) comparable single distributions 
and Multivariate Distribution 


13. 2. Let {X11, Xo1,...,Xr1},.--,{X10, X2e,---;Xre} be independent 
multinomial random variables with parameters 
(m; P11, pot; - Pri) ss (Nc; Pic; P2c,-++, Pre) respectively. Let 
C [o 


Xi. = M Xj and M ^ nj = n. Show that the LRT for testing 
j=l j=l 

H: pj — pi , Vi — 1,...,r; forall j = 1,2,...,c against all 

alternatives K can be based on the statistic 


fie iiy 


j=1j=1 


Saurav De (Department of Statistics PresideLRT : Applications under k (> 2) comparabl 29 / 34 


LRT : Applications under Ag 
k (> 2) comparable single distributions 


and Multivariate Distribution 
TUTORIAL DISCUSSION : 


Overview to the problems from MODULE 13... 


M13.1. Since the basic set up of the problem and hypotheses to be tested 
are same as in Application 3. so we will use same notations and formule 

in this problem also. 

The ML estimates remain same as in Application 3. except of ø which is 
known (= oo, say) here. So 


1 1 2 
supL(0) = exp (vj -Y.-(j- v.) 
Qu, (2102)? o» y J 


1 1 
= TSS — SSB 
(2102)? Sap { 202 ( ) 


Saurav De (Department of Statistics PresideLRT : Applications under k (> 2) comparabl 


30 / 34 


Gorm Jathshala 


LRT : Applications under VTOSITeIT 


k ( 2) comparable single distributions 
and Multivariate Distribution 


i y.-Gi-y.)-( ;-y.)Y 
milar] L a 1 ij 
PE = á 21, UN, (6) (2705)"/? QP 203 


= = Gu Wwe exp (-( (TSS SSA — SSB) )/208 } 


If A(y) denotes LR criterion then here 


1 1 SSA 
In A(y) = 2a P SSB] 4 252179 SSA — SSB] = E 
SSA 
= —2InA(y) = px 3 X53 


0 


Saurav De (Department of Statistics PresideLRT : Applications under k (> 2) comparabl 31/34 


T" fo 

LRT : Applications under Ag press 
k (> 2) comparable single distributions 

and Multivariate Distribution 

een : From sampling distribution we can write 


La Vy a N(u, 3 Eg o*) independently Vi =1,...,p 
Hence, again fom ut well-known result of the sampling distribution on 
normal population we get 


p p 
D (ZAZ) q (= Y.) 
i=l jet l H. 
a e xor re cà dz X53 
SSA H 
Or oa 2 A ] 


So the CR: (y) <A & -21InA(y) > c i.e. Xt, = SSA > c, where c is 


determined as X pi from the size-y condition. Similarly, we perform test 
of H versus Kp. 


Saurav De (Department of Statistics PresideLRT : Applications under k (> 2) comparabl 


32 / 34 


LRT : Applications under Ag atheneln 
k (> 2) comparable single distributions 


and Multivariate Distribution 


M13. 2. Without loss of generality here we assume singular multinomial 
distribution. So we can use 


pij + paj t+... + py — 1 Vj and under H, py + p t... p, —1 
From discussion on MLE in Module 5, we get that under Qy U Qx the ML 
estimate of pj; is s Thus 


Cc 


EXON 


j=1 j=1li=1 
On the other hand the likelihood under Qy is 
l Xe 3» 


n; j 
L(0) = Hoa pe sepe? ;O«pj;«1. 


Saurav De (Department of Statistics PresideLRT : Applications under k (> 2) comparabl 33 / 34 


LRT : Applications under Qanman 
k (> 2) comparable single distributions 
and Multivariate Distribution 


Then following the conventional way of finding MLE under multinomial 
probability model (discussed in Module 5) we get MLE of p; as 
Y 


i RE A. (Readers! Try yourselves) 
c i r 
al DXX 
— sup L0) 7 L TES HIC. 
H j=l i=1 


Thus finally the LR criterion will be 


Saurav De (Department of Statistics PresideLRT : Applications under k (> 2) comparabl 34/34 


Ë Jathshala 
9 UToOSITelT f p otafinda 


m 


Some Asymptotically Equivalent Tests 
Module 14 


Saurav De 


Department of Statistics 
Presidency University 


Saurav De (Department of Statistics Preside 


Some Asymptotically Equivalent Tests 


m 
Some Asymptotically Equivalent Tests Ag ween 


Now let us consider the following tests having the similar asymptotic 
behaviour. 


Start with the following asymptotic aspects. 


We know from definition of likelihood ratio criterion that 


Lx(00) 
L.(6n) 


A(x) = 


where x denotes data from random sampling, Ó,: the MLE of 0, 0o: 
hypothesised value of 0. 


= —2lnA(x)-2 (0) — l«(00)| , where / denotes the 
loglikelihood. 


Saurav De (Department of Statistics Preside Some Asymptotically Equivalent Tests 2/29 


Some Asymptotically Equivalent Tests SETTE Yon 


From Taylor's expansion we can write 


4 "m 6, — bo)? porn 
ls (89) = In(On) + (£o — ên (Bn) + n — 20)" mt y 


— (6o) — lôn) e C p, ) 


Or —2In A(x) 22 [1 (5) — Ix(60)| Š [vn(d = 0) } (—21(6n)) 


Saurav De (Department of Statistics Preside Some Asymptotically Equivalent Tests 3/29 


Some Asymptotically Equivalent Tests SETTE 


Now use the fact that 


lw 1 p^ m 
-Ak (ôn) = Ae b agi 09) loa, ((85). 


Hence we get 


Qig = —2InA(x) 22 (On) 2 Ix(80)| Š [vnb 2 &)) i(0,) 


Saurav De (Department of Statistics Preside Some Asymptotically Equivalent Tests 4/29 


Some Asymptotically Equivalent Tests  Cifgsthehats 


Again expanding with respect to 0o, we get from another Tailor's expansion 
Kn) ~ (00) + (Ôn — 0) (00) 
But //(0;) — 0 as 0,: MLE of 6. 


A I (8 


Saurav De (Department of Statistics Preside Some Asymptotically Equivalent Tests 5/29 


Or 


NULCON 
"(I (89)? 
(/(00)? 
n(—$12(80))? 
(8 
 ni?(0g) 


— 
3 
~~ 
So» 
3 
l 
D 
e 
— 
— 7 
N 
2 


Qv 


Or we can write 


^ 2 ^ a 1 2 
{ vn(bs — 6) i(Bn) & CP. under O ... ... (*) 


Alternative way: follow the approach of iterative MLE and take initial 
approximation as 0o itself. 


Saurav De (Department of Statistics Preside Some Asymptotically Equivalent Tests 6/29 


To 
Some Asymptotically Equivalent Tests Ag ween ai 


e Hence we can alternatively say that 
N a 1 2 
Qır = —21nA(X) 22 a) — Ix (69), Fd Um under ĝo. 


e Also zh (00) = 252 ,óefe( X la N(0, i(09)) ... ... (sx) 


[Justification : Define Y; — 2 fex) là; - So Y; for i = 1,2,... are iid 
(as Xjs are iid) with E(Y1) 2 0 ; V(Y1) = i(00) 


D ‘ 
Hence by CLT 2-5 Y; = VnY => N(0, i(69))] 


Saurav De (Department of Statistics Preside Some Asymptotically Equivalent Tests 7/29 


à Dathshala ‘ 
e 


Some Asymptotically Equivalent Tests @lgarserrenr 


Based on the above asymptotic equivalence and asymptotic normality, we 
have the following two popular tests. 


WALD'S TEST 


Single parameter(0) case: (for 0 : real valued) Suppose the Fisher's 
information /(0) exists at all 0. Then one should reject H : 0 = 0o if the 
observed value of the statistic 


Qw — n(8, = 60)? i(8n) 


is significantly large. This Qw is known as the Wald test statistic for 
single parameter testing H : 0 = 09. 


Saurav De (Department of Statistics Preside Some Asymptotically Equivalent Tests 8/29 


. á Ro 
Some Asymptotically Equivalent Tests @@athshals 


Provided e certain regularity conditions for the asymptotic normality of the 
MLE 6, hold and e assuming that /(0) is continuous in the neighbourhood 
of 0o, (*) and (*«) directly imply, under H : 0 = 66, 


D 
Qw — xi. 
Hence for large n, H : 0 = @ will be rejected at level o if 


Qw = n(n = 09)? i(0;) > Nie 


Saurav De (Department of Statistics Preside Some Asymptotically Equivalent Tests 9/29 


Some Asymptotically Equivalent Tests QER ds 


Multi-parameter(@) case: (for 0 : s-component vector valued) Suppose 

Ó, : MLE of 0 and the Fisher information matrix /(0) exists for all 0. 
Then the null hypothesis H : 0 — 09 should not be accepted if the value of 
the statistic 


Qw = [n8 — 00) } 1(8n) (vn(à, ~ 80) } 
is significantly high. 


Assuming /(@) is smooth at 0o and the regularity conditions for the 
asymptotic multivariate normality for 0, hold, under H : 0 = 06 


Qw = (vn, - 00) ¥ (0) (va(9, - 00)} -> x2. 


=> for large n, reject H at level a if Qw > Xea: 


Saurav De (Department of Statistics Preside Some Asymptotically Equivalent Tests 10 / 29 


athshala 


Some Asymptotically Equivalent Tests SETTE 


RAO'S SCORE TEST 


Single parameter(0) case: Suppose lx(0) is differentiable with respect to 0 
for each x. Then 2 (0) = [,(0) is called score function of 0. 
Also suppose the Fisher information /(@) exists and is positive at 0o. 


Now if 9 is the true value of 0, it will be close enough to 6,. As a result 


I (09) should be close to 0 (since (6, y= = 0). 


Thus signifivantly high value of {he ae oy leaves evidence against the null 
hypothesis H : 0 = @. 


The statistic Qs = UR is called Rao score statistic under real valued 
parametric case. 


Saurav De (Department of Statistics Preside Some Asymptotically Equivalent Tests 11 / 29 


Some Asymptotically Equivalent Tests a Qotnshata bd 


1 2 
Also , under H : 0 = 69 the asymptotic distribution of Leo} is X? with 1 
degree of freedom. 


/ a A 25 UA 
[Justification: Ë & [y (8, — 00) V i). 


^ D 
And vn (In — 80) = NCO, sy) 


ie. y/n (6, — 2) v/ I(09) = N(0,1) (using asymptotic distribution of 
univariate MLE). Hence justified. | 


Saurav De (Department of Statistics Preside Some Asymptotically Equivalent Tests 12 / 29 


Some Asymptotically Equivalent Tests ster | 


Thus for large n, under real-valued parameter 0, Rao's score test: 


lia 


ə reject H at level a if CR >x 


Multi-parameter(@) case: Let 0 be s( 1) component. Suppose 


e |, is differentiable partially under the integral sign with respect to 
each component 0; of 0 for every x, 


e the Fisher information matrix /(0) exists and is invertible at Qo, and 
e the support of the parent distribution is independent of 0. 


Then under H : 0 = 0o, (using the asymptotic multivariate normality of 
vector-valued 0) 


Qs = S'I-!(69)8 S x2 


1 


where S; = Jas! (0 ) lee Oo ° 


Saurav De (Department of Statistics Preside Some Asymptotically Equivalent Tests 13 / 29 


MHRD 


ovt. of India 


Some Asymptotically Equivalent Tests SETTE 


The vector S is called the score vector. 


Following the same idea as in Single parameter(0) case for large n, reject 
H at level a if 


Qs = S'I 7 (60)S > X50 
Rao's score test 
e is also known as the Lagrange multiplier test in econometrics, 
e is frequently meant for testing a simple nul hypothesis H : 0 = 09 and 


e provides the most powerful test when the true value of @ lies in the 
close vicinity of of 0o. 


Saurav De (Department of Statistics Preside Some Asymptotically Equivalent Tests 14 / 29 


. á Ro 
Some Asymptotically Equivalent Tests @@athshals 


Why the Rao's score test most powerful in locality of 09: 
Neyman-Pearson lemma always provides the most powerful test and the 
the test can be expressed as 

L(0o4-h|x) 

"(Ap) = K 


< log L(09 + h | x) — log L( | x) > log K (x) 


Saurav De (Department of Statistics Preside Some Asymptotically Equivalent Tests 15 / 29 


Some Asymptotically Equivalent Tests SETTE bind 


Also from Taylor series expansion we can write 


log L(0o + h | x) ~ log L(@ | x) + h. & log log L(6 | 9) (x) 
6—6o 


[ignoring the higher order terms as these are negligible if h is very small i.e. 
if 0 = 09 + h is close to 0] 


Saurav De (Department of Statistics Preside Some Asymptotically Equivalent Tests 16 / 29 


Some Asymptotically Equivalent Tests  Cifgsthehats 


Hence from («) and (*«) we can say that 


ð 
h. ( log log L(6 | 2) > log K 
00 0=0 


1 2 
or in other way Hoh- > c, where c is chosen from the size(o) condition. 
In particular, for large n, c — Xio 


Saurav De (Department of Statistics Preside Some Asymptotically Equivalent Tests 17 / 29 


MHRD 


ovt. of India 


Some Asymptotically Equivalent Tests SETTE 


Note that, under simple null hypotheses 


e Qs uses less assumptions compared to Qw and Qir for x? 
approximation 


@ Qs statistic does not require computation of the maximum likelihood 
estimate of 0. 


So these features are advantageous because 


e the test is applicable even when the unconstrained maximum 
likelihood estimate is a boundary point in the parameter space, and 


ə the null distribution of Qs can be approximated by x? in wide 
applications. 


Saurav De (Department of Statistics Preside Some Asymptotically Equivalent Tests 18 / 29 


athshala 


Some Asymptotically Equivalent Tests SETTE 


Note. 


e Like Qır, Qw and Qs can also be framed for testing composite nulls 
of the form H : 4(0) = (^1(80),...,y4(0)) = 0. Qs statistic, now, 
depend on the restricted MLE of 0 under H. However in this case 
also, the asymptotic XS distribution under H is maintained by each of 
these three statistics. 


9 In literature of statistics, LR test, Wald test and Rao's score test, can 
be referred to as the Holy Trinity due to their wide applicability. 


e All these tests are asymptotically equivalent at least to the first order 
of asymptotics, but 


e may differ to some extent in the second (or higher) order properties. 


As a result of fact, 


e no one is uniformly superior to the others. 


Saurav De (Department of Statistics Preside Some Asymptotically Equivalent Tests 19 / 29 


Some Asymptotically Equivalent Tests eter 


Likelihood Ratio, Wald and Score Confidence Sets 


Using the usual duality (i.e. the equivalence) between CRs of testing of 
hypothesis and confidence sets, the acceptance region (complementary set 
of CR) of a test with size œ can be inverted to a confidence set with 
confidence coefficient at least (1 — o). This method is known as the 
inversion of a test. 

Let A(0o) denote the acceptance region of a size-a test for testing 

H : 0 = 09. Inverting the set A(05) we get the set 


S(x) = {09 : x € A(09)) > Pa[S(X) 509] > 1- a 
i.e. in general Pg[S(X) 5 6] 2 1— a. 


=> S(X): a 100(1 — a)% confidence set for 0. 


Saurav De (Department of Statistics Preside Some Asymptotically Equivalent Tests 20 / 29 


athshala 


Some Asymptotically Equivalent Tests ster 


The confidence sets obtained from the LRT, the Wald test and the Score 
test are respectively known as the Likelihood Ratio, Wald and Score 
Confidence Sets. 

Note that 


e Since the Wald and the Score test statistics are defined as quadratic 
forms, the corresponding confidence sets will be ellipsoid. 


e The LR confidence set does not have any such pre-specified form and 
is typically more complicated. 


We can illustrate the construction of these confidence sets through the 
following example : 


Example. Suppose x1,...,x; be n realisations on Bernoulli(@) 
distribution. Let H : 0 = 69 versus K : 0 F Op. 


Saurav De (Department of Statistics Preside Some Asymptotically Equivalent Tests 21 / 29 


: ; Qo 
Some Asymptotically Equivalent Tests GQathshala ind 
By LRT the LR criterion will be 


0§(1 — 69)"* - 


APO sup (0x(1— 0)n-x) ' = y» 
0 


- Hü-Ayt gx 


ô (1-8). j 
Then the LR critical region will be (x : A(x) < A} where A is 5 the size of 


the test is o. Thus inverting the acceptance region we get the 
100(1 — a)% exact confidence set 


Sip(x) = 0: oar d A es [0 :*(1— 8)y7* > n 


da-8 
= (0:xlog0 4- (n — x)log(1— 0) > k} 


Saurav De (Department of Statistics Preside Some Asymptotically Equivalent Tests 22 / 29 


Some Asymptotically Equivalent Tests iQ atnsnata 


As here 0 is real-valued, so S; p(x) will provide an interval of the form 
[0, ui] or [, 1] or (45, u2] ko £4, £o , U1, U? < 1. 


Unfortunately no explicit or closed form is possible for /'s and u's. But 
using R program, we can illustrate how 100(1 — a)% exact confidence 
interval can be generated by simulation technique. 


Let n = 10, 09 = 0.6, a = 0.05 and the realization be 
(05 1,0) 1, 09139. 1,1). 


Saurav De (Department of Statistics Preside Some Asymptotically Equivalent Tests 23 / 29 


HRD 


Govt. of India 
] 


Some Asymptotically Equivalent Tests GU ETE 


Illustration using R : (to find the confidence interval) 
R Code and Output : 


library(varhandle) 
n-10 

p=0.6 

lambda=NULL 

for(j in 1:1000) 


AVVV VV 


x=rbinom(10000,n,p) 
W=(x*log(p))+((n-x) *log (1-p) ) 
W1i=sort (W) 
mat-as.data.frame(table(W1)) 
CF=NULL 

sum=0 

for(i in 1:length(mat[,1])) 
1 

sum-sum-*mat[,2][i] 

CF [i]=sum 

} 


Saurav De (Department of Statistics Preside Some Asymptotically Equivalent Tests 24 / 29 


fo athshala 


Some Asymptotically Equivalent Tests Ak TESTAT 


R Code and Output (continued) : 


mati-cbind(mat,CF) 
lambda[j]=unfactor (mat1i[,1] [which(mati[,3]>=9500)]) [1] 
x=NULL 


View (lambda) 

# usable lambda/cut-off value 
k=mean (lambda) 

# now, to find limits of ’p’ 
# realization 


x_star=1+1+0+1+1+1+1+0+1+1 
pi=seq(0.1,0.9,0.1) 
W2=(x_star*log(p1))+((n-x_star) *log(1-p1)) 
check=ifelse (W2>=k,1,0) 
mat2=data.frame(pi,W2, check) 
suit=mat2$pi [which (mat2$check==1) ] 


VVVVV VV VV NN ne 


Saurav De (Department of Statistics Preside Some Asymptotically Equivalent Tests 25 / 29 


Some Asymptotically Equivalent Tests 


R Code and Output (continued) : 


# limits of "p" 
lower=suit [1] 

upper=suit [length(suit)] 
lower ; upper 

EU Oy 

[LIE Q5 


V VM NM 


Saurav De (Department of Statistics Preside Some Asymptotically Equivalent Tests 


Some Asymptotically Equivalent Tests iQ atnsnata 


Here we will have /(@) = way: 


^ 2 
mt 6—8 
Hence the Wald test statistic will be: Qw = a) 
The size-a critical region from the large sample Wald test will be 
n(6—00)° = 2 
61-6) — ^ta 


So the 100(1 — a)% large sample Wald confidence interval of 0 will be 


= L - /x2,,8(1 — 8)/n, 6 + xa Â — n. 


—> the heuristic large sample confidence interval of 0. 


Saurav De (Department of Statistics Preside Some Asymptotically Equivalent Tests 27 / 29 


Some Asymptotically Equivalent Tests a Qotnshata nn 


Again here the score function = 4/,(0) 


j x — nO 
ER -x)| ~ 
= gg 80 + (n - x) log(1 — 8) = y — 
Hence we have Qs = Coe. 


Similar to the Wald approach, here also the 100(1 — a)% large sample 
Score confidence interval of 0 will be 


= W o 6? (n? + nx.) — 0(2nx + nx 1) Ext < 0} 


Ss(x) 


Saurav De (Department of Statistics Preside Some Asymptotically Equivalent Tests 


28 / 29 


. á Ro 
Some Asymptotically Equivalent Tests @@athshals 


If £ and u (£ < u) are two roots of the quadratic equation 
0? (r? + nx$.,) — 6(2nx + nx$.,) + x? = 0 in 6, then we have 


Ss(x) = (0 dio PU u) <0} — (0 07 < 0 € u} = [£, u] 
N.B. Check that here 


T 2nx + nc — V n?e? + Ar? xc — 4nox? 
A 2(n2 + nc) 


and 


| 2nx + nc + v n2c2 + 4n2xc — 4ncx2 
i 2(n?2 + nc) 


where c — Xa. 


Saurav De (Department of Statistics Preside Some Asymptotically Equivalent Tests 


29 / 29 


| MHRD 


Govt. of India 


Ë Jathshala 
e S UTOSITeIT 


MLE under Survival Data: Type | and II Censoring 
Module 15 


Saurav De 


Department of Statistics 
Presidency University 


i 


Saurav De (Department of Statistics PresideMLE under Survival Data: Type | and II Ceng 


'"MHRD 


ovt. of India 


MLE under Survival Data: S athenata 
Type | and II Censoring 


Censoring in particular is a key issue in survival analysis. 


Censoring distinguishes survival analysis from regular statistical 
problems. 


e Censoring is when an observation is incomplete due to some random 
cause. 


e The cause of censoring is usually dependent on the event of interest. 


Saurav De (Department of Statistics PresideMLE under Survival Data: Type | and II Ceng 2/28 


MHRD 


ovt. of India 


MLE under Survival Data: GQathanen 
Type | and II Censoring 


Censoring differs from truncation in that the incomplete nature of the 
observations in truncation occurs due to a systematic selection process 
inherent to the study design. 


Based on the directions through which incompleteness in the observations 
comes, cencoring is of three types 


e Right Censoring e Left Censoring e Interval Censoring 


Saurav De (Department of Statistics PresideMLE under Survival Data: Type | and II Ceng 3/28 


MHRD 


ovt. of India 


MLE under Survival Data: @Qothshata 
Type | and II Censoring 


Right censoring : The most common form of censoring 

Here the lifetime of an item is followed until some time at which the event 
(i.e. failure or death) is yet to occur; but the event takes no further part in 
the study after that time. 


e.g. A lung cancer patient is recruited for clinical trial to test the effect of 
a drug on his survival from his disease. 

But he died in a car accident after T years of his disease. 

=> his survival with lung cancer is at least T years, but exact years can 
not be known. 

=> right censored. 


Saurav De (Department of Statistics PresideMLE under Survival Data: Type | and II Ceng 4/28 


Jathshala 


MLE under Survival Data: @Qothshata 
Type | and II Censoring 


MHRD 


ovt. of India 


Left censoring : This occurs when the event of interest has already taken 
place at the time of observation; but the exact time of occurrance of the 
event is not known. 


e.g. 
e Onset of an asymptomatic illness, like Brain Cancer 
e Infection with a sexually transmitted disease like HIV / AIDS 


Saurav De (Department of Statistics PresideMLE under Survival Data: Type | and II Ceng 5/28 


MHRD 


ovt. of India 


MLE under Survival Data: @Qothshata 
Type | and II Censoring 


Interval Censoring : 


e Here the exact time of the occurrance of the event is not known 
precisely, but an interval bounding this time is known 


e [n case the interval is too short (e.g. 1 day or 1 hr etc.) the common 
practice is to ignore the interval censoring and to set one end-point of 
the interval consistently 

e.g. 
e failure of a machine during Chinese New Year celebration 


e Infection with a sexually transmitted disease like HIV / AIDS in 
between two annual check-up 


Saurav De (Department of Statistics PresideMLE under Survival Data: Type | and II Ceng 6 / 28 


MLE under Survival Data: @Qothshata 
Type | and II Censoring 


Depending on how censoring mechanism will work, there are three broad 
types of censoring 


e Type! Censoring e Type Il Censoring e Random Censoring 


We will discuss in brief 
@ above three types in right censoring form 


e the MLEs of the corresponding parameters under the survivorship 
probability models 


Saurav De (Department of Statistics PresideMLE under Survival Data: Type | and II Ceng 7/28 


MHRD 


ovt. of India 


MLE under Survival Data: @Qothshata 
Type | and II Censoring 


Type II censoring: 
e Suppose n random sample units are set on life-testing experimentation 


e But due to some reasons the experiment terminates after smallest r 
readings 


€ Let these be denoted by the order statistics T(1),..., T(r). 


e Here integer r is prefixed i.e. nonrandom. 


Saurav De (Department of Statistics PresideMLE under Survival Data: Type | and II Ceng 8 / 28 


MHRD 


ovt. of India 


MLE under Survival Data: @Qothshata 
Type | and II Censoring 


Since the remaining n — r random sample values are atleast as high as 
T(r) ==> the sampling scheme is a censored one. 


Such a censoring is known as Type II censoring. 


Type Il censoring are frequently used in life-testing experiments. 
e Here say total of n items are placed on test. 


e Now instead of continuing until all n items get spared, suppose the 
experimenter waits just for the first r failures. 


e Such test saves both time and money. 


Saurav De (Department of Statistics PresideMLE under Survival Data: Type | and II Ceng 9/28 


MLE under Survival Data: @Qothshata 
Type | and II Censoring 


Let T; denote the lifetime / failure time of ith item. 


Suppose T;'s be iid having a continuous distribution with pdf f(t) and cdf 
Fo(t) where 0 : parameter of the distribution. 


Then given t(1),...,t(;j; the realization of T(3),..., Tgr), the likelihood of 
0 under Type II censoring is 


L(0) = a pitta) fet) [Folt] 


Saurav De (Department of Statistics PresideMLE under Survival Data: Type | and II Ceng 10 / 28 


MLE under Survival Data: @Qathshala 


l SONST é 
Type | and Il Censoring 
Verification: 


From theory of order statistics, the jount pdf of all the ordeer statistics 
Tap et, T(n) is 


holta) -- -> tm) = n [ [ f(t) 
i=1 


= the marginal joint pdf of T(1),..., To) at t), ... tr) will be 


8o(t(1); NS try) = I aed f nl] [$(t5)dt(; Sis dt...) 
i=1 


i=1 t(n—1) 


fo(t(..1))dt(n 1) -- - tir) 


Saurav De (Department of Statistics PresideMLE under Survival Data: Type | and II Ceng 11/28 


MLE under Survival Data: @Qothshata 
Type | and II Censoring 


= aJ foto) f J O- Fiet fo(t(n—2))--- 
j=l as 


- "T (.— Fa(tin-2 ) 
= nl] [ f(t) f care / A ) fo( t(n—2) ) dt(n—2) fo(t(n—3)) e" 
i=1 


Un—3) 


n! M, | 
P eeo f 7 ps LL 
i=l CM 


fo( t(n—a)) see 


r 
. | T 2 
=> finally we get 8o(t(1); dies t(r)) = mop] [f(a )[Fo(t(y)1” £ 
i=1 
DI 


MLE under Survival Data: @Qothshata 
Type | and II Censoring 


But given the realizations, the form of joint pdf ==> the likelihood of 0. 
Hence the form of the likelihood is verified. 


Here Fo(t): the survival function at the time point t. 
Illustration : Let the of an item be exponential with mean 0. 
=> the pdf f(t) = 1 exp (—t/0) 

and the survival function at the time point t is 


Fo(t) = exp(—t/0) 


Saurav De (Department of Statistics PresideMLE under Survival Data: Type | and II Ceng 13 / 28 


MLE under Survival Data: Ag ween pr 
Type | and II Censoring 


=> under Type II censoring, the likelihood function of 0 will be 


L(0) = 5 o exp | (S240 + =k) g 
Siua (n— r )t(r) 


I(0) = Const — r log @ — ^! 7 


Dot n-- Dg 
I(8) - -4 
= l'(0) 9 p g2 
= the unique solution of likelihood equation /'(0) = 0 will be 


S ti + (n — r)t(r) 
A= i=1 
F 


Saurav De (Department of Statistics PresideMLE under Survival Data: Type | and II Ceng 14 / 28 


MLE under Survival Data: ense 
Type | and II Censoring 


From SOC we can ensure that Ê maximises L(0) i.e. Ê is the MLE of 0 
under type II censoring. 


p 
> Ti) +(n-r) T(r) 


Note. à = =1________ is the MVUE of 0. 


Verification. Joint pdf of T(1),..., T(,) is 


, ti) + (n — r)t(r) 
i=1 
0 


n! 1 
e) 7 Grae te 


Define Z; = nT) Z = (n —i+ 1)( To) = Tay) P=2, 55057 


Saurav De (Department of Statistics PresideMLE under Survival Data: Type | and II Ceng 15 / 28 


MLE under Survival Data: a 
Type | and II Censoring 


Check that the Jacobian of transformation from 
(T jessa = (Zie 2) p e and 


x x (n — r) Th =r rô. 
i=1 


=> the joint pdf of Z1,..., Zr is 


r 
di 


which implies that Z1,...,2Z, are iid exponential (mean = 0) random 
variables. 


Saurav De (Department of Statistics PresideMLE under Survival Data: Type | and II Ceng 16 / 28 


MLE under Survival Data: GQathanen 
Type | and II Censoring 


— F,(0) = Es (£z) [cA 
i=1 
3 Tan T 


Also 8o(t.)) € an OPEF. == the statistic Ô = L i 
complete sufficient. Hence by Lehman-Scheffe Theorem the Note follows. 


Type I Censoring 
Sometimes experiments are run over a fixed period of time Ð the exact 


lifetime of an item will be known only if it is less than some 
pre-determined value. 


Saurav De (Department of Statistics PresideMLE under Survival Data: Type | and II Ceng 17 / 28 


MHRD 


ovt. of India 


MLE under Survival Data: @Qothshata 
Type | and II Censoring 


In such a situation data are said to be type | censored (from right). 


More precisely a type | censored sample is one that arises when 


e n items numbered say 1,2,...,n are subject to limited periods of 
observations, and 


let Ly,...,L, be those periods 5 
ith item's lifetime T; is observable only if T; < L;. 


L; : called fixed censoring time for ith item 


If all L; are equal, data are said to be single type | censored. 


Saurav De (Department of Statistics PresideMLE under Survival Data: Type | and II Ceng 18 / 28 


MHRD 


ovt. of India 


MLE under Survival Data: @Qothshata 
Type | and II Censoring 


Assume that Tjs are iid with common pdf f(t) and survival function 
F(t). 


From ith item we record the exact the exact lifetime 7; as the realization 
provided T; < L;. Otherwise L; is recorded as the realization. 


Let Y; denote the potential response (the response which is surely 
obtained) from ith item. 


Saurav De (Department of Statistics PresideMLE under Survival Data: Type | and II Ceng 19 / 28 


MLE under Survival Data: GU ETE 
Type | and II Censoring 


Then 


Y; = 7; if T; € Lj (called uncensored case) 
= L; if T; > L; (called censored case) 


for all i. —> Y; = min {T;, L; y. 


Also define indicator variables 


Ôi 1 if T; < L; (called uncensored case) 
0 if T; > L; (called censored case) 


Then ójs are called censoring indicators. 


Saurav De (Department of Statistics PresideMLE under Survival Data: Type | and II Ceng 


20 / 28 


MLE under Survival Data: iQ atnsnata p» 
Type | and II Censoring 


So the type | censored data can be represented by the pairs of random 
variables (Yj, ó;) for all i. 


==> the jount likelihood of 0 for given data set {(tj,0;),/=1,...,n} on 
(Yj, ó;)s will be 


n 


L(0) =] (a(t [Fo] 8? 


i=1 
How this is obtained ? It is true that Pe[Y; = y; | 6; = 0] = 1 if y; = Li. 


Pe|Y; = yi, ði = 0] Po[d; = 0] = Po[T; > Li] ify; = Li 
= Fo(Li) 


i.e. the likelihood for the ith item is 
Li(@) = Fo(L;) if 6; = 0( yi — Li) EE (x) 


Saurav De (Department of Statistics PresideMLE under Survival Data: Type | and II Ceng 21 / 28 


MLE under Survival Data: GU ETE y, A 
Type | and Il Censoring 


Again 
PolYi < yndi 21] = PolTi < yi] (asd; =1 T; & Lj & Y; = Ti) 
= Folyi) 
=> L;(0) = fo(yi) if oa I... (**) 


(x) and, —5 L;(0) = [o (ti)? [FL] A 


As pairs (Yj, 6;)s are independent, the joint likelihood of 0 will be 


L(0) = [ Lt cor? [For] 8? 
i=1 


Saurav De (Department of Statistics PresideMLE under Survival Data: Type | and II Ceng 22 / 28 


MLE under Survival Data: @Qothshata bie 
Type | and II Censoring 


Suppose the readings (in some suitable unit) of life from 10 items, set on 
an experimentation, are as follows: 


1.4* , 0.17, 1.4, 1.4, 0.28 , 0.94 , 1.4* , 0.7 , 1.07 , 1.20 
where reading with * is censored from right. If the life distribution is 
Weibull with density 


f(t) = afit£-1e-ot? ,t>0;a,6>0, 
and also a = 1, find the ML estimate of 8 from the life data readings. 


Computation. From the nature of censoring, the data are type | censored 
from right and has the common censoring time point 1.4. 


Saurav De (Department of Statistics PresideMLE under Survival Data: Type | and II Ceng 23 / 28 


MLE under Survival Data: @Qothshata 
Type | and II Censoring 


Also here 8 is the only unknown parameter to be estimated. 
Note that for the given Weibull distribution, F(t) — eet” Hence the 
likelihood function of 8 (with a = 1) will be 


10 
ĝi TE 1—ó; 
L(8) = [ [ {F(ti)}" {FD} 

j=1 
where 0; : censoring indicator and L : the common censoring time (= 1.4 
here). Therefore 

10 
10 — õit? 

L(B) = eye re i=1 @7(10—r)L? 
i=1 


10 
where r = X 6; = number of uncensored cases. t;s denote exact readings. 
i=1 


Saurav De (Department of Statistics PresideMLE under Survival Data: Type | and II Ceng 24 / 28 


MLE under Survival Data: GU ETE * 
Type | and II Censoring 


So the loglikelihood of @ will be 


10 10 
£8) = rIn sitio > 1)» 76; In t; — ps5 — (10 — r)L? 


i=1 i=1 


ð 
» B 
x a )= pens S Si Int; - (10 nt In L. 


i=1 
Hence the likelihood equation of 8 reduces to the form 


10 -1 


ET Yu Int; +(10—r)L?InL—S slnt) Luise (x) 


i=1 


Saurav De (Department of Statistics PresideMLE under Survival Data: Type | and II Ceng 25 / 28 


MLE under Survival Data: Qanman 
Type | and II Censoring 


Govt. of India 
] 


(*) does not have any explicit solution. So we have to solve it numerically 
(using Newton Raphson method) for the ML estimate of 8. To find out 
the initial value of 6, we used the quantile method. 


R program for the solution of numerical equation (*) : 
R Code and Output : 


t2c(1.4,0.17,1.4,1.4,0.28,0.94,1.4,0.7 ,1.07 ,1.20) 
del=c(0,1,0,0,1,1,0,1,1,1) 

max.iter=100 

r=sum (del) 

# initial value of "beta" 
qu=quantile(t) 
init=log(log(4))/log(as.numeric (qu[4])) 
> init 

[1] 0.9707614 

> beta=NULL 

> beta[1]-init 


VM ONE NINES NNN. 


Saurav De (Department of Statistics PresideMLE under Survival Data: Type | and II Ceng 26 / 28 


| MHRD 


Govt. of India 
) 


MLE under Survival Data: S ET 
Type | and II Censoring 


R Code and Output (continued) : 


> # ist derivative function 

> funi=function(b) 

1 

sumi-0 

sum2=0 

for(i in 1:length(t)) 

1 

sumi-sumi-*(del[i]*log(t[il)) 

} 

for(j in 1:length(t)) 

1 
sum2-sum2-*(del[jl*((t[j1) ^b) *1og(t[j1)) 
} 
11=(r/b)+(sum1)-sum2-((10-r)*((1.4) *b) *log (1.4) ) 
return(11) 

} 


Saurav De (Department of Statistics PresideMLE under Survival Data: Type | and II Ceng 27 / 28 


: ido 
MLE under Survival Data: egressa 


Type | and II Censoring 
R Code and Output (continued) : 


Govt. of India 
] 


> # 2nd derivative function 

> fun2-function(b) 

1 

sumi-0 

for(i in 1:length(t)) 

1 

sumi-sumi-*(del[i]*((t[il)^b)*1log(t[il)*1og(t[il)) 

} 
12=-(r/(b*b))-sumi-((10-r)*((1.4) “b) *log (1.4) *log (1.4) ) 
return(12) 

} 

> for(k in 2:max.iter) 

1 
beta[k]-beta[k-1]-(funi(beta[k-1])/fun2(beta[k-11)) 
if(beta[k]-beta[k-1] «0.0000001) 

break 

} 

> # MLE of ‘beta’ (converged value) 

> betal[k] 

[1] 1.244887 


Saurav De (Department of Statistics PresideMLE under Survival Data: Type | and II Ceng 28 / 28 


MHRD 


ovt. of India 


Ë Jathshala 
e S UTOSITeIT 


MLE under Survival Data: Type | and Random 
Censoring and K-M Estimator 
Module 16 


Saurav De 


Department of Statistics 
Presidency University 


Saurav De (Department of Statistics PresideMLE under Survival Data: Type | and Randoi 19/533 


MLE under Survival Data: ser 
Type | and Random Censoring 
and K-M Estimator 


An illustration under type | censoring: 


Consider exponential life of an item with mean 0. 
Under type | censoring the likelihood of 0 will be 


L(0) = II [&(&]* [Fo] 8? 
i=1 


where fo(t) = 5 exp{—t/0} and Fo(t) = exp{—t/6} 


Saurav De (Department of Statistics PresideMLE under Survival Data: Type | and Rando: 2/33 


MLE under Survival Data: S ETE 


Type | and Random Censoring 
and K-M Estimator 


If we define a set A = {i | 6; = 1}, we can write 


- TI; exp {—t;/6} . [| exp {-Li/9} 


F ICAS 
à zel- 35:4 3 | 
ICA ic.A* 


n 
where r X 0j X 6; = # uncensored cases observed. 
ieA i=1 


Saurav De (Department of Statistics PresideMLE under Survival Data: Type | and Rando: 


3/33 


MLE under Survival Data: nas H 
Type | and Random Censoring 
and K-M Estimator 


Now we can write y» + D Li = » 


ICA ICAS i=1 


= L(0)— L exp seo] 


n 


Soy 


Hence get the MLE of 0 as 6= E 


Note: Here r is not fixed but a random variable. 


Saurav De (Department of Statistics PresideMLE under Survival Data: Type | and Randoi 4/33 


MLE under Survival Data: ser 
Type | and Random Censoring 


and K-M Estimator 


=> |t is difficult to find the exact sampling distribution of ô. 
Therefore consider the asymptotic distribution of 0. 
Asymptotic distribution of MLE ==> 

^ it. ^ D —1 

0 = 0n —> N (0,1 (8)) 


where /(0) — Fisher's Information 


Saurav De (Department of Statistics PresideMLE under Survival Data: Type | and Rando: 5/33 


MLE under Survival Data: S ETE 
Type | and Random Censoring 


and K-M Estimator 
i.e. I(0) = E -8 ám log L(6)| 


Now L(0) = gr ref- Sra} 


22 4 3 yi 
— = i=1 u vf m 
And : 
o log L(8) = r 2 e 
age “S 62 03 


Saurav De (Department of Statistics PresideMLE under Survival Data: Type | and Randoi 6/33 


MLE under Survival Data: GU ETE s e 
Type | and Random Censoring 
and K-M Estimator 


Now 
E(Y;) = E(Y; | à; = 1)Po [ó; = 1] + E(Y; | 9; = 0) Po [ó; = 0] 


As Po [Y; = Li | 6; = 0] = 1 so E (Y; | 6; = 0) = Lj. 
Lj 
t4 exp( —t/0)dt 
Also E (Y; | & = 2) E(T; | T; € Li) = 9—pu 


= 6(1— exp (-L;/0)) 4 E = (4 + 1) exp {—L;/0}| 


Saurav De (Department of Statistics PresideMLE under Survival Data: Type | and Rando: 1/33 


MLE under Survival Data: ser 
Type | and Random Censoring 


and K-M Estimator 


On simplification (xx) => 


E(Y) = 9 h ` (5 4 i) exp {-1i/6)] + L; exp {—L;/0} 


= 0(1—expi-L;/0]) 


=>? E(r) = Y E(5) Y P4l5 = 1] 
i-i i-i 


n 


X (1 — exp {-Li/9}) = Qsay 


i=1 


Saurav De (Department of Statistics PresideMLE under Survival Data: Type | and Rando: 8 / 33 


MLE under Survival Data: eter 


Type | and Random Censoring 
and K-M Estimator 


Hence finally (x) => 
205 (1 — exp(—;/0]) 


i=1 Q 
1(8) = z Cake, 


— ÂN (o. 5) 
Note: MLE of Q will be Q — 1— exp 4 — Lj/Ó 
Y i-e tut] 
(by invariance property of MLE) 


Saurav De (Department of Statistics PresideMLE under Survival Data: Type | and Randoi 9/33 


MLE under Survival Data: @Qothshata 
Type | and Random Censoring 
and K-M Estimator 


Note: For large n, the normality assumption for Ó is good. However for 
small n, it is rather poor. There are alternative approximate methods that 
can be recommended in the context of asymptotic normality even for small 
n. One such approach discussed below is due to D. A. Sprott (1973, 
Biometrika). 

Sprott showed that the transformation b = 9-1/3 converges in distribution 
to normality more closely than Ê itself, even for small n. 


Obviously here ¢ = 6071/3, Also fom Taylor's expansion we get 


E(4) e 6 — 671? 


Saurav De (Department of Statistics PresideMLE under Survival Data: Type | and Rando 10 / 33 


MLE under Survival Data: ser 


Type | and Random Censoring 
and K-M Estimator 


Q 


a : 
me Vas, (8 
PORTO 
slays 2 0? X 6-2/3 _ o? 
3 Q 92 9Q 
Thus for testing any hypothesis or constructing any confidence interval of 
some parametic function of 0, we can start with 


And V (à) 


= Wo1) o, 22% 2, (o1). 
ge Q2 
s 96 


11/33 


aurav De (Department of Statistics PresideMLE under Survival Data: Type | and Rando 


S 


MHRD 


ovt. of India 


MLE under Survival Data: @Qothshata 
Type | and Random Censoring 
and K-M Estimator 


Random Censoring : the most frequently used censoring in statistics. 


A random censoring is very similar to type | censoring. The only difference 
is that here the censoring times are also random variables. 


A simple random censoring process is one in which /th item is assumed to 
have lifetime 7; and censoring time C; with 7; and C; independently 
distributed continuous random variables. 


Saurav De (Department of Statistics PresideMLE under Survival Data: Type | and Rando 127/133 


MLE under Survival Data: GQathanen jd 
Type | and Random Censoring 
and K-M Estimator 


ə Let T; be iid with common pdf f;(-) and common cdf Fo(-) 
e Let C; be iid with common pdf g(-) and common cdf G(-). 
Define (Y;,0;) same way as in type | censoring. 


Then the likelihood function of 0 for given data set ((tj,0;), i = 1,..., n] 
on the paired random variables (Yj, 6;)s will be 


n 


L(0) = TE [G6] " [e(G)Fo(6)] ^ 


i=1 


Saurav De (Department of Statistics PresideMLE under Survival Data: Type | and Rando 139/533 


MLE under Survival Data: ser 
Type | and Random Censoring 
and K-M Estimator 


Verification of the form of likelihood: 
Under random censoring we know 
Po [yi < Yi < yi + Ayi, ði 0] = Po [yi < Gi € yi + Ayi, T; > yil 
(as =0 6 T; 5 Cr e Y= CG) 
= P |y; < Ci € yi + Ayi] Po [T; > yi] 


(as T; and C; are independently distributed) 


Saurav De (Department of Statistics PresideMLE under Survival Data: Type | and Randoi 14 / 33 


MLE under Survival Data: atn: 
Type | and Random Censoring 
and K-M Estimator 


=> with 6; = 0 the likelihood corresponding to ith unit will be 


: "a7 E 
L;(0) = lie Po [yi € Yi € yi + Ayi, 0; 0] 


= i)Fetyi 
JI. Ay, &(yi)Fo(yi) 


On the other hand 
Po |y; < Yi € yi + Ayi, ði = 1] = Pa [yi < Ti < yi + Ayi, Ci > yi] 
(as now bp = 1 T; & Ge Y; = T;) 
= Ps [yi < Ti € yi + Ayi] P [Ci > yi] 
— now L;(0) = fo(yi)G(yi) 


Saurav De (Department of Statistics PresideMLE under Survival Data: Type | and Rando 52/33 


MLE under Survival Data: Ag Seer jd 
Type | and Random Censoring 
and K-M Estimator 


Thus we have 


L(0) = g(t)Fa(t;) with 5; =0 
= f(t;)G(t;) with à; = 1 


Combining these two forms we can write 


Li(0) = [fo(ti)G(ti)]” [e(t)Fo(t)] ^ for all i. 


—— the joint likelihood function of 0 based on given data set on 
independent (Yj, 6;) pairs will be 


n 


L0) = DL TISGOSGO]" [elt Fot] ^ 


i=1 


Saurav De (Department of Statistics PresideMLE under Survival Data: Type | and Rando 16 / 33 


MLE under Survival Data: @Qothshata 
Type | and Random Censoring 
and K-M Estimator 


Informative and Noninformative Random Censoring : 


Let 0 : Parameter of the probability distribution of lifetime variable, T;. 

If the probability distribution of the censoring variable C; also involves 0 as 
its parameter, the random censoring is informative. 

(Reason : Then censoring variables also add information about 0.) 


Otherwise, it is noninformative. 
e.g. T; ~ Exponential (0) , C; ~ Exponential (76) 

=> informative censoring. 
T; ~ Exponential (0) , C; ~ Exponential (o); ¢ independent of 0 


=> noninformative censoring. 


Saurav De (Department of Statistics PresideMLE under Survival Data: Type | and Rando 17 / 33 


MLE under Survival Data: @Qothshata 
Type | and Random Censoring 
and K-M Estimator 


Kaplan-Meier (K-M) Estimator/Product Limit (PL) Estimator of 
survival Function 


Let ty,...,t, be uncensored sample observations on failure times. 


Then a nonparametric estimate of the survival function S(t) at the time 
point t is given by 


__ #Fobservations > t 


R,(t) = ————————— .... ... (1) 


n 


This is basically the complementary empirical distribution function at the 
point t. 


But usually we cannot expect uncensored failure data due to many 
practical limitations. 


Saurav De (Department of Statistics PresideMLE under Survival Data: Type | and Rando 18 / 33 


MHRD 


ovt. of India 


MLE under Survival Data: GQathanen 
Type | and Random Censoring 
and K-M Estimator 


Consider a type | censored sample. 


In this case # lifetimes or failure times > t may not be known exactly. > 
we need some modification in (1). 


The modified estimator by incorporating appropriate way the sense of type 
| scensoring , is called the PL estimate of the survival function. 


PL estimator — > also known as K-M estimator from the authors who first 
discussed its properties. 


Saurav De (Department of Statistics PresideMLE under Survival Data: Type | and Rando 19 / 33 


MHRD 


ovt. of India 


MLE under Survival Data: Ag n 
Type | and Random Censoring 
and K-M Estimator 


Let there be n items and k(< n) distinct failure times tj < to < ... < tk 
observed. 

Let d; = # failures at time point t; 

In addition to failure times, there are censoring times L; for items whose 
lifetimes are not observed due to some reasons. 

let nj = # items at risk of failing at t; (i.e. # items that are functioning 
and uncensored just prior to tj). 

Then the K-M estimator is defined as 


Rc) - [I ME 


nj 
: J 
Jtt 


where nj41 = nj — dj — Cj; cj = # items censored at tj. 


Saurav De (Department of Statistics PresideMLE under Survival Data: Type | and Rando 20 / 33 


MLE under Survival Data: GQathanen 
Type | and Random Censoring 
and K-M Estimator 


Result : The K-M estimator is a nonparametric MLE of the survival 
function S(t). 


Justification : Let T : Lifetime of a randomly chosen item. 
= > S(t) — P|[T > t] at the time point t. 


Letti < t2 <... < ty € t € ty. 
Then we can look upon S(t) as 


S(t) = BIT > &]P[T > t| T»8]...P[T »t| T » t 
Let A; — Probability that a randomly chosen item will fail at the time t; 


given that it survived at tj_1. 


Saurav De (Department of Statistics PresideMLE under Survival Data: Type | and Rando 21/33 


MLE under Survival Data: @Qethsnata d 
Type | and Random Censoring 
and K-M Estimator 


0)" (1-099 wo)... (152908 


= IIa-» 


Jtt 


As nj = # items under the risk of failing at t; given that they survived at 
tj—ı (like # independent Bernoulli trials) and 


d; = # items actually failed at t; (like random # successes out of n; trials) 


= dj ~ Binomial (nj, A;), for all j 


Saurav De (Department of Statistics PresideMLE under Survival Data: Type | and Rando 225/533 


MLE under Survival Data: Ag atheneln 
Type | and Random Censoring 
and K-M Estimator 


We already know that for binomial distribution MLE of A; is \j = ys 


nj 


=> by invariance property, the MLE of S(t) will be 


R(t)= I[a-35- II (1 i ij — K-M estimator 


; , nj 
J:tj<t J:tj<t 


Note : d; always follows binomial distribution irrespective of the parent 
population distribution Fg of the lifetime T;. 

== the distribution of K-M estimator is distribution free i.e. K-M 
estimator is a nonparametric estimator of the survival function and also 
the MLE of it under type | censored case. 


Saurav De (Department of Statistics PresideMLE under Survival Data: Type | and Rando 23 / 33 


MLE under Survival Data: Ginman 
Type | and Random Censoring 


and K-M Estimator 
The estimated asymptotic variance of the K-M estimator R,(t) : 


log R,(t) = P log(1 — À;). 


Jtt 


So, from delta method, the asymptotic variance of log P, (t) will be 


2 
A Ó A A 
V (log R0) = 5 E log(1— in] ls za, V(t- Àj) 
jitjt j 
(1—) , 
re »2 (1 xy Aj) 
j:tj<t Hy 
LU S E a 
pes 


Saurav De (Department of Statistics PresideMLE under Survival Data: Type | and Rando 24 / 33 


MLE under Survival Data: G Qathshaa 
Type | and Random Censoring 
and K-M Estimator 


So the estimated asymptotic variance of log R,(t) will be 


: " Aj qi 
P(t) Xe E 


Jet J 


Again, using delta method we have 


A 2 
O (RA) = (ee) Y (log Rn(t)) 
2 d 
a (R0) 2-0 d) 


This is known as Greenwood's formula (1926) for asymptotic variance of 
K-M survival estimator. 
25 / 33 


MLE under Survival Data: GU ETE 
Type | and Random Censoring 
and K-M Estimator 


Example. Calculate the K-M estimate of S(t) for the following data, 
where 6; = 1 if individual / died at time t; and 6; = 0 if individual į was 
censored at that time, for i = 1,...,8. 


£t 
= 


o Non A UO N FSF |: 
m me 
N e 
cOnÍ|mnbBocomnmunmuó 


Saurav De (Department of Statistics PresideMLE under Survival Data: Type | and Randoi 26 / 33 


MLE under Survival Data: Ginman 
Type | and Random Censoring 
and K-M Estimator 


The K-M estimates R,,(t;) of the survival function at different time points 
t; are given in the following table: 


i ti ój n 1— Aj R,(ti) 

le wt 54 i 0.875 

pw» 4^ C 7$ Z.2 = 0.750 

3 8 16 3 Z.$.2 = 0.625 
41105 3 1.9.8.2 = 0.625 
51204 å Z.8.3.3.4 = 0.625 

6 15 1 3  $ $.$.2.2.5.$ = 0.417 
7 20 1 2 f 7.6.2.3.4.2.5 = 0.208 
8 23 0 1 1 19222411 = 0.208 


Saurav De (Department of Statistics PresideMLE under Survival Data: Type | and Rando 27 / 33 


MLE under Survival Data: @Qothshata 
Type | and Random Censoring 
and K-M Estimator 


Example Consider the following failure time data measured in some 
suitable unit : 


6 4 4 10 5* 5 8 11 6 8 6* 8 4* 4 4 7* 6 10* 3 5 
where * denotes the reading is right censored, not exact failure time. 


Using R program, find the estimates of the survival function at different 


time points assuming (i) the failure time T ~ shifted Exponential 


distribution with density f(t) = Le- ^^ , t > a > 0 and (ii) T ~ F, 


where F is absolutely continuous. 


Saurav De (Department of Statistics PresideMLE under Survival Data: Type | and Rando 28 / 33 


MLE under Survival Data: @Qothshata 
Type | and Random Censoring 
and K-M Estimator 


Solution. Given data are Type | censored data. 
(i) Assuming shifted exponential failure distribution, we get from Module 


(vi-Ya)) 
15 that the ML estimate of (a, o) = (yu) 2 jr, where y;s are the 
potential responses, y(1) = min (y1,..., yn) and r = # uncensored cases. 


Also the survival function at time point t will be 
(t—o) : . . 
S(t) = P[T»t]—-e 7 .Hence its (ML) estimate will be 


SN) =e  $ . Using R program we will find these estimates. 


Saurav De (Department of Statistics PresideMLE under Survival Data: Type | and Rando 29 / 33 


'"MHRD 


ovt. of India 


MLE under Survival Data: @Qothshata 
Type | and Random Censoring 
and K-M Estimator 


Assumption (ii) leads to the nonparametric estimation of the survival 
function under Type | censoring. Hence we will use K-M estimator to 
estimate the survival probability at each time point. The R program in 
support of this computation and the findings are as follows. 


R program for the estimation of survival probabilities : 


R Code and Output : 


library (survival) 
t=c(6,4,4,10,5,5,8,11,6,8,6,8,4,4,4,7,6,10,3,5) 
aec e Gi yo O52 ,0 5251 .0,0,1,0 51,1050 05090) 


df=data.frame(t,del) 

# parametric 

alpha=min(t) 
sigma=sum(t-alpha)/sum(del) 
Sni=exp(-(t-alpha)/sigma) 


W ME NENE NENE NME N 


Saurav De (Department of Statistics PresideMLE under Survival Data: Type | and Rando 80/533) 


MLE under Survival Data: S ETE 
Type | and Random Censoring 


and K-M Estimator 


R Code and Output (continued) : 


> Sni 
[1] 0.5436906 0.8161762 0.8161762 0.2412612 0.6661436 0.6661436 
0.3621760 
[8] 0.1969117 0.5436906 0.3621760 0.5436906 0.3621760 0.8161762 
0.8161762 
[15] 0.8161762 0.4437473 0.5436906 0.2412612 1.0000000 0.6661436 
> # non-parametric 
> fit=survfit (Surv (df$t,df$del)~1,df) 


Saurav De (Department of Statistics PresideMLE under Survival Data: Type | and Rando: 319/133 


MLE under Survival Data: @Qothshata 
Type | and Random Censoring 
and K-M Estimator 


R Code and Output (continued) : 


> summary (fit) 
Call: gurvfit (formula = Surv(df$t, df$del) ~ 1, data = df) 


time n.risk n.event survival std.err lower 95% CI upper 95% CI 


3 20 1 0.950 0.0487 0.8591 1.000 
4 19 4 0.750 0.0968 0.5823 0.966 
5 14 2 0.643 0.1087 0.4616 0.895 
6 11 3 0.468 0.1170 0.2862 0.764 
8 6 1 0.390 0.1207 0.2123 0.715 
10 3 1 0.260 0.1331 0.0951 0.709 
11 il 1 0.000 NaN NA NA 

> Sn2-fit$surv 

> Sn2 

[1] 0.9500000 0.7500000 0.6428571 0.4675325 0.4675325 0.3896104 
0.2597403 


[8] 0.0000000 
> # K-M plot 
> plot(fit,xlab-"t",ylab-expression(S(t)),main-"Kaplan-Meier Plot") 


Saurav De (Department of Statistics PresideMLE under Survival Data: Type | and Randoi 32:733. 


MLE under Survival Data: ser 
Type | and Random Censoring 


and K-M Estimator 
The Kaplan-Meier Plot corresponding to the problem : 


Kaplan-Meier Plot 


1.0 


0.8 


0.6 


S(t) 


0.4 


0.0 


Saurav De (Department of Statistics PresideMLE under Survival Data: Type | and Randoi 33/133 


a Jathshala 
e S UTOSITeIT 


Partial Likelihood and Cox Proportional Hazard Model 
Module 17 


Saurav De 


Department of Statistics 
Presidency University 


Du 


Saurav De (Department of Statistics PresidePartial Likelihood and Cox Proportional Haza 


Partial Likelihood and @Qothshata 
Cox Proportional Hazard Model 


e Let T : lifetime of an item following continuous probability 
distribution and P(T > 0) — 1. 


e F(t) or S(t) or P(T » t) : Survival probability of the item at time 
point t. 


e S(t) : known as survival function at time point t. 


e H(t) = [- log.S(t)] : cumulative hazard function at time point t. 


Saurav De (Department of Statistics PresidePartial Likelihood and Cox Proportional Haza 217] Si 


Partial Likelihood and atn: 
Cox Proportional Hazard Model 


e h(t) = 2 H(t) = FE logeS(t)] : instantaneous failure/hazard rate at 
time point t. 


e h(t) : known as hazard function at time point t. 


A 
Note : h(t) = x = 1, where f(t) : lifetime density at time 


point t. 
e S(t) | t & [- logeS(t)] t t. So, h(t) = £ [- loge S(t)] > 0, V t. 


Saurav De (Department of Statistics PresidePartial Likelihood and Cox Proportional Haza Sf Se 


MHRD 


ovt. of India 


Partial Likelihood and Ca Qatnenata 
Cox Proportional Hazard Model 


Survival Regression or Hazard Regression : 


ə Survival or hazard of an item at time t usually depends on some 
characteristics like gender, age, Body Mass Index (BMI), etc. Such 
characteristics are known as covariates. 

e Let x denote a covariate supposed to affect h(t), now expressed as 
h(t, x), without loss of generality. 


ə Thus, survival or hazard regression can be looked upon as the 
functional dependence of h(t, x) on covariate x. 


Saurav De (Department of Statistics PresidePartial Likelihood and Cox Proportional Haza 4/37 


MHRD 


ovt. of India 


Partial Likelihood and Ca Qatnenata 
Cox Proportional Hazard Model 


Purposes of hazard regression : 


e To explore a mathematical model like A(t, x) = g(x), where g(x) : 
suitable functional form. 


e To predict an estimate of hazard corresponding to some covariate 
value. 


e To measure effect of covariate x on hazard function, h(t, x). 


ə To test the significance of covariate information x on h(t, x). 


Saurav De (Department of Statistics PresidePartial Likelihood and Cox Proportional Hazal 53T 


MHRD 


ovt. of India 


Partial Likelihood and ser 
Cox Proportional Hazard Model 


Cox's Proportional Hazard Model : 


h(t, x) = ho(t, x) e? * 


, where x denotes a vector of covariates; 
« and 8 denote the parameters of the Cox's regression model. 


Saurav De (Department of Statistics PresidePartial Likelihood and Cox Proportional Haza 6/37 


Partial Likelihood and Ca Qatnenata 
Cox Proportional Hazard Model 


Some realisations on Cox's regression model : 


e ho(t, o) : the value of h(t, x) at x = 0 (i.e., a baseline choice of x). 
That is why, ho(t, œ) is called baseline hazard function at time point 
t. 

e ho(t, a) depends on t, but not on covariates. 

e Term eB x depends on covariates x, but not on time point t. 

e B measures the effects of x on h(t, x). How ? 
For univariate x, denoted by x, "ERE = e? = (j gauges the impact 
of change in h(t, x) for unit change in x value. 


Saurav De (Department of Statistics PresidePartial Likelihood and Cox Proportional Hazal Test 


MHRD 


ovt. of India 


Partial Likelihood and Ginman 
Cox Proportional Hazard Model 


Some realisations on Cox's regression model (continued) : 


e Consider two individuals with covariates x1 and x2. Then, the ratio of 
their hazards at time t is given by : 
h(t,x1) E ho(t, a) eh n E ep (x1—x2) 
h(t,x2) ho(t, a) eB x2 


which is constant with respect to time t 


Saurav De (Department of Statistics PresidePartial Likelihood and Cox Proportional Hazal 8/37 


MHRD 


ovt. of India 


Partial Likelihood and @Qothshata 
Cox Proportional Hazard Model 


= h(t,x1) x h(t,x2) with respect to time t. 


e In other words, the hazards are proportional V t. Hence, the name 
Proportional Hazard Model (P.H.M.). 


i.e. hazard ratio of two individuals at age 8 years and age 80 years 
will remain same. It seems to be little unrealistic at times. 


Saurav De (Department of Statistics PresidePartial Likelihood and Cox Proportional Haza 9/37 


MHRD 


ovt. of India 


Partial Likelihood and Ca Qatnenata 
Cox Proportional Hazard Model 


Some realisations on Cox's regression model (continued) : 


e In Cox's P.H.M., one will be interested in parameters 3, which 
measure the effects of covariates on survival, but usually we are not 
interested in a. 
=> (8 — parameters of interest, œ — nuisance parameters. 

e Hence, no form is pre-specified for baseline hazard, ho(t, œ). The 
Cox's P.H.M. is thus called a semi-parametric model. 


Saurav De (Department of Statistics PresidePartial Likelihood and Cox Proportional Haza TOP 37 


Partial Likelihood and Ca Qatnenata 
Cox Proportional Hazard Model 


Estimating the covariate parameters 9 : Partial Likelihood 


ə As here, the form of ho(t, a) is not specified, so the form of 
likelihood function is not known properly — the usual method of 
maximum likelihood fails. 

e For estimating the regression parameters 3, Cox developed a 
non-parametric method and called it as Partial Likelihood. 


Saurav De (Department of Statistics PresidePartial Likelihood and Cox Proportional Hazal 119/375 


Partial Likelihood and Ransa 
Cox Proportional Hazard Model 


e Let m = number of individuals under study. 

ô; = 1, if i^ individual is uncensored. 

= 0, if it” individual is right-censored. (i = 1,..., m). 

Define R(t;) = set of all individuals surviving or functioning at time t; 
— Risk set at tj. 

Let h;(t) denote hazard of jt” individual at time t. 

e Then, given that t; is an event time (i.e., failure/death time), the 


probability that individual ; has that event is given by : 
hi(t;) 
? iens) y(t) 


, [i] : individual i. 


P([]Iti) = 


Saurav De (Department of Statistics PresidePartial Likelihood and Cox Proportional Haza ed er 


Partial Likelihood and ser 
Cox Proportional Hazard Model 


ə As per Cox's P.H.M. assumptions, hj(t, xj) = ho(t, œ) eÊ *i 


- ] B ox B x 

= P(li]|t;) = ho(ti,a) e i : = eÊ *i N = > [7 2 , (say) 
Dj € R(t) ho(t;,ox) e? *j Dye R(t) e? 5 jeR(t;) PI 

/ 


where, j = e? x. 


e P([i]|t;) may be called as Risk Probability of individual j at time 
point tj. 


Saurav De (Department of Statistics PresidePartial Likelihood and Cox Proportional Haza 1930 


Partial Likelihood and Ca Qatnenata 
Cox Proportional Hazard Model 


Partial Likelihood for unique failure times : 


e Suppose at each event time only one individual observes the event. 
Also, the individual observes the events independently. — The partial 
likelihood for B is given by : : : 

Ll) - T. [| = mm (CPI) - 

e Power ô; means the individuals having the events (failure/death) 

actually contribute to the likelihood, but not the right-censored cases. 


Saurav De (Department of Statistics PresidePartial Likelihood and Cox Proportional Haza 14 / 37 


Partial Likelihood and GiQathshata 
Cox Proportional Hazard Model 


Why the name partial ? Two reasons for that are : 


@ L, is not the full likelihood function of both the parameters œ and £, 
but 6 only. 

@ Lp does not use the full data; as not the actual times of occurrences 
are important, but their rankings are, e.g., the likelihood remains 
same no matter if the individuals j, j and k have event times {1,2,3} 
or {10,80,94}. 

Merit of partial likelihood method : The method involves less 
assumptions and hence is more robust than full likelihood method. 
Demerit of partial likelihood method : The method is less powerful 
compared to a fully parametric model. 


Saurav De (Department of Statistics PresidePartial Likelihood and Cox Proportional Hazal 157 er 


Partial Likelihood and ser 
Cox Proportional Hazard Model 


Note : Partial likelihood acts exactly in a similar way as the full likelihood. 


Let Bp denote the Maximum Partial Likelihood estimate of (. 


P. iu eP Xi ôr 
=> Lp(Bp) = sup || [LÁ , 


f 
F Xj 
Pp j=1 jer(t) €? 7 


Saurav De (Department of Statistics PresidePartial Likelihood and Cox Proportional Haza 119377 en 


Partial Likelihood and Ca Qatnenata 
Cox Proportional Hazard Model 


Properties of Bp : Maximum Partial Likelihood estimate of G 


> P 
Q Bp — B as m — oo. 


Q Disp(Bp) RI jc where lp is calculated from L, exactly in the same 
way as usual Information Matrix from full likelihood. 


© 8, > N(B, I; 1) as m + oo. 


Saurav De (Department of Statistics PresidePartial Likelihood and Cox Proportional Hazal 17y 3T 


Partial Likelihood and Ca Qatnenata 
Cox Proportional Hazard Model 


Partial Likelihood for repeated failure times : 


The case when two or more individuals may have same event times. 

Let tq; be it” distinct ordered event time, i.e., t) < t+) (that means, 
ordered) and four failure times 1, 1, 3, 3 > tqa) = 1 and ti) = 3 (that 
means, distinct). 


e | — number of unique event times, and 


e D(t) be the set of individuals having event at time point t. 


Saurav De (Department of Statistics PresidePartial Likelihood and Cox Proportional Haza 187/37 


Partial Likelihood and @Qothshata 
Cox Proportional Hazard Model 


e Under repeated failure times, there are three popular methods of 
framing likelihood, which are : 
@ Breslow's method. 
Q Efron's method. 
Q Exact method. 
e Consider the following situation first. 
Suppose individuals labelled 1-5 are at risk of failing at t) = R(t(;) 
= { 1, 2, 3,4,5 1. 
ə Let individuals 1-3 actually fail at t) = D(t() = (1, 2, 3 J. 


Saurav De (Department of Statistics PresidePartial Likelihood and Cox Proportional Hazal 19 / 37 


Partial Likelihood and Ag sidan 
Cox Proportional Hazard Model 


Then Breslow's method contributed from time t(; to Lp by : 


oi x $2 x $3 
(pitos) © (pirts) ^ (1+ 5) 


= formally, we have from Breslow's method that : 


I : 
v "hac RD S Ijen (tn) $j | 
L,(8, =I I mj “Woe ») $i) | D(t) |. 


where, | D(t(;)) | = number of individuals in set D(t(;)). 


It assumes that all the 3 individuals fail simultaneously at t(j) — a crude 
and unrealistic assumption. 


Saurav De (Department of Statistics PresidePartial Likelihood and Cox Proportional Hazal 20-37 


Partial Likelihood and Ca Qatnenata 
Cox Proportional Hazard Model 


Due to Efron's method contribution to Lp through illustration is : 
$1 92 $3 . 
(dit--+ds) (105 1 (01-02-03) (du -05— 2 (51-02-63) 


It assumes non-instanteinity in failure of 3 individuals and accordingly 
denominators are adjusted. 


As it is not known a priori that who is the first to fail, so one-third of 
5 


(1 + $2 + $3) is adjusted from Sé; after one fails. Similarly two-third 
j=1 
of (1 + $2 + $3) is adjusted after first two individuals fail. 


Saurav De (Department of Statistics PresidePartial Likelihood and Cox Proportional Haza PA ery 


Partial Likelihood and Ca Qatnenata 
Cox Proportional Hazard Model 


So, in general, under Efron's method : 


: IIje D(t) Êj 


Il 
i=1 IN AU 


Lala x m 


k—1 
ie R(t) ej mi [D(t(5)| 2 ;ieD(t) $j 


This method is much more accurate. Also this method is the default 
method of partial likelihood for fitting Cox PHM using in R saftware. 


Saurav De (Department of Statistics PresidePartial Likelihood and Cox Proportional Hazal 22/37 


: H A ik Jathshala ' 
Partial Likelihood and G B YTSSITGT 
Cox Proportional Hazard Model 
Using the illustration, finally Exact method contributes : 

010263 
010263 + $1264 + 102605 +++: + 030405 
— in general, Exact method defines : 
- Je D(t) Qj 


Lp(B, x) = 
el z II Djeg Dq 


where, Q; = set of all | D(t(;) |-tuples that could be selected from R(t(;) 
and 6, = the product of ó;'s over all members j of a | D(t) |-tuple q. 


Note : 
e Efron's method is closer to Exact method. 


e Under untied case, these three methods are just same with Lp for 
unique failure time. 


Saurav De (Department of Statistics PresidePartial Likelihood and Cox Proportional Haza 23 


Partial Likelihood and iQ athshata 
Cox Proportional Hazard Model 


Partial likelihood estimate by numerical method: 


The above three methods of forming partial likelihood function are quite 
difficult for the purpose of fitting the (Cox PH) model parameters 
analytically. 

Let B be k—dimensional model parameter vector and our objective is to 
get 3 that maximises the log partial likelihood function £»(B). 
Newton-Raphson method, as discussed below, can be used to get the 
estimate of parameters : 


e Choose an arbitrary value BO as an initial approximation to 8. 


Saurav De (Department of Statistics PresidePartial Likelihood and Cox Proportional Haza 24 / 37 


Partal Likelilood and GRatnenaia diy 


Cox Proportional Hazard Model 


e Define score vector 


wpa = (289). ... bet 


0b. O OB, 
and the k x k information matrix I(B) whose (i, j)th element is 
3L (B) 
oßpiðß; ` 


e Get the first approx. B® = B®) + 1-18) u(B) 


Second approx. B = B®) + 1-1 (g(?)u(gQ?) and so on. 
e The iterative method will converge at r + 1th step if (3? and (7*0 


agree upto certain decimal places, and then ml estimate B = go or 
Birt), 


e Further B® from B, more is r i.e. less likely is the convergence to B. 


Saurav De (Department of Statistics PresidePartial Likelihood and Cox Proportional Haza 25/37 


Partial Likelihood and Ca Qatnenata 
Cox Proportional Hazard Model 


An Illustration. Let x; : the body mass index (BMI) of individual /. 
Suppose a study is conducted on individuals who suffered from heart 
attack. the following 9 individuals died and t; : the day(s) to death for the 
individual ; following the attack. Data are: 


iq d AS 4 5 6 7 8 9 
ti] 6 98 189 374 1002 1205 2065 2201 2421 
xi | 31.4 21.5 27.1 22.7 35.7 307 265 283 27.9 


Saurav De (Department of Statistics PresidePartial Likelihood and Cox Proportional Hazal 26 / 37 


Partial Likelihood and S CETT 
Cox Proportional Hazard Model 


Here Cox PHM : h(t;, xi) = A(t;, oe?" 
The partial loglikelihood will be 


9 9 
l(b) — B »xi - 9 log | So e^ 
a JER(t;) 
`> xjePXi 


2» ee eßbxi vee (*) 


jeR(t;) 


Saurav De (Department of Statistics PresidePartial Likelihood and Cox Proportional Haza 


27 / 37 


Partial Likelihood and oh, 


WTOSITeIT 
Cox Proportional Hazard Model 
Hence 


2 9 
ag) = -35——À —á (#9 


=1 
3 eP*xi 
jeR(ti) 
2 
where A — p» m ehe D eP% | — `> x;e?* 
jeR(t;) jeR(t;) jeR(t;) 


Obviously here U(8) = 3545(8) and I(8) = — imp (8). 


=> the Newton-Raphson recursive update relation is 


k 
gis es Bo ng L 3 pestis 


Saurav De (Department of Statistics PresidePartial Likelihood and Cox Proportional Hazal 28 / 37 


Partial Likelihood and o M.. 
Cox Proportional Hazard Model 


Let us choose 8) = 0. 


Hence from (*) and (*«) we compute U(0) = —2.512 and /(0) = 77.13. 
So 


B® = 0 — 2.512/77.13 = —0.0326 


This way we get (on computation) 
B® = —0.0326 — 0.069/72.83 = —0.0335 , 


8°) = —0.0335 — 0.000061 /72.70 ~ —0.0335 


As both 8(2&,8(3) agree upto 4 decimal places — the iteration converges 
correct upto 4 decimal places. 


Hence, the ML estimate B = —0.0335(correct upto 4 decimal places). 


Saurav De (Department of Statistics PresidePartial Likelihood and Cox Proportional Hazal 29 / 37 


Partial Likelihood and seem rere. 
Cox Proportional Hazard Model . 


Fitting Cox PHM with partial likelihood method using R :- 


Consider the similar type of data as follows: 


i lal (74 93,744, "5" 6 O8 09 10 
t| 6 98 98 98 98 189 374 374 374 1002 
ee € 41V 0 L.C" o 1 1 1 

xi | 31.4 21.5 23.2 22.54 95 27.1 227 22.7 20.6 357 
z | 89 93 69 58 84 81 81 74 79 70 


Contd ... 


Saurav De (Department of Statistics PresidePartial Likelihood and Cox Proportional Haza 30 / 37 


Partial Likelihood and Ca Qatnenata 
Cox Proportional Hazard Model 


j| 11 12 13 14 15 16 17 18 
t; | 1002 1205 1205 1205 2065 2201 2201 2421 
ó;| 1 1 0 1 1 1 0 1 
x; | 24.7 307 251 288 265 283 295 27.9 
z | 76 73 67 70 86 71 75 73 


where t; : days to death or right censoring for ith individual, 
6; : Censorship indicator for individual /; 6; = 1, if died; 6; = 0, if censored, 
(xi, zi) : ( BMI , Age ) for individual i. 


Using R, fit Cox PHM and study different inferential aspects of your fitting 
based on these repeated-event-time data. 


Saurav De (Department of Statistics PresidePartial Likelihood and Cox Proportional Haza Sil ey 


Partial Likelihood and GiQathshata 
Cox Proportional Hazard Model 


Fitting through Breslow’s, Efron’s and Exact Method of partial 
likelihood using R: 


R Code and Output : 


> library (survival) 
> tec 
(6,98,98,98,98,189,374,374,374,1002,1002,1205,1205,1205 , 2065 ,2204 


> delta=c Gil te OPI pal gal oi ooh © aby al ool 5@), al) 
> bmi=c 
(Sil An il ols 528] 59] 5921s ler 5818 5915305 22] 5 Tf 529 ol 8X0 5 (9 neo (f pe xe o If 29 odis 


> age=c(89,93,69,58,84,81,81,74,79,70,76,73,67,70,86,71,75,73) 

> df=data.frame(t,delta,bmi,age) 

> a=coxph(Surv(df$t, df$delta) ~ df$bmi + df$age,df,method = " 
breslow") 


Saurav De (Department of Statistics PresidePartial Likelihood and Cox Proportional Haza BQ Si 


Partial Likelihood and 
Cox Proportional Hazard Model 


R Code and Output (continued) : 


> summary (a) 

Call: 

coxph(formula = Surv(df$t, df$delta) ~ df$bmi + df$age, data = df, 
method = "breslow") 


n= 18, number of events= 14 


coef exp(coef) se(coef) z Pr(>lzl) 
df$bmi -0.01820 0.98197 0.10412 -0.175 0.8613 
df$age 0.07958 1.08283 0.04461 1.784 0.0745 


Signif. codes: 0 * kk 0.001 ** 0.01 * 0.05 
(0). 3l 1 


Saurav De (Department of Statistics PresidePartial Likelihood and Cox Proportional Haza days 


Partial Likelihood and iQ athshata 
Cox Proportional Hazard Model 


R Code and Output (continued) : 


exp(coef) exp(-coef) lower .95 upper .95 
df$bmi 0.982 1.0184 0.8007 1.204 
df$age 1.083 0.9235 0.9922 1.182 


Concordance= 0.708 (se = 0.105 ) 
Rsquare= 0.196 (max possible= 0.963 ) 


Likelihood ratio test= 3.92 on 2 df, p=0.1407 

Wald test = 4.1 on 2 df, p=0.1284 

Score (logrank) test = 4.31 on 2 df, p=0.1158 

> b=coxph(Surv(df$t, df$delta) ~ df$bmi + df$age,df,method = "efron 
ny 

> summary (b) 

Call: 

coxph (formula = Surv(df$t, df$delta) ^ df$bmi + df$age, data = df, 
method = "efron") 


n= 18, number of events= 14 


Saurav De (Department of Statistics PresidePartial Likelihood and Cox Proportional Hazal ELSTA 


Partial Likelihood and Ci Qathshata 
Cox Proportional Hazard Model 


R Code and Output (continued) : 


coef exp(coef) se(coef) z Pr(>lzl) 
df$bmi -0.02690 0.97346 0.10567 -0.255 (0). Tr 99! 
df$age 0.08192 1.08537 0.04461 1.836 0.0663 
Signif. codes: 0 # x 0.001 ** 0.01 * 0.05 
0.1 1 


exp(coef) exp(-coef) lower .95 upper .95 


df$bmi 0.9735 1.0273 0.7913 iL WONT 
df$age 1.0854 0.9213 0.9945 1.185 
Concordance= 0.692 (se = 0.105 ) 

Rsquare= 0.209 (max possible= 0.961 ) 
Likelihood ratio test= 4.22 on 2 df, p=0.1211 
Wald test = 4.42 on 2 df, p=0.11 
Score (logrank) test = 4.66 on 2 df, p=0.0975 


Saurav De (Department of Statistics PresidePartial Likelihood and Cox Proportional Hazal 357 3T 


Partial Likelihood and SQ arnenata 
Cox Proportional Hazard Model 


R Code and Output (continued) : 


> d=coxph(Surv(df$t, df$delta) ^ df$bmi + df$age,df,method = "exact 
" 
) 
> summary (d) 
Call: 
coxph(formula = Surv(df$t, df$delta) ~ df$bmi + df$age, data = df, 
method = "exact") 


n= 18, number of events= 14 


coef exp(coef) se(coef) ZIP r3 IEZIID 
df$bmi -0.02102 0.97920 0.10851 -0.194 0.8464 
df$age 0.09451 1.09912 0.05071 1.864 0.0624 


Signif. codes: 0 # KK 0.001 ** 0.01 * 0.05 
0.1 1 


Saurav De (Department of Statistics PresidePartial Likelihood and Cox Proportional Hazal 367-37 


Partial Likelihood and S CETT 
Cox Proportional Hazard Model 


R Code and Output (continued) : 


exp(coef) exp(-coef) lower .95 upper .95 
df$bmi O. 9792 1.0212 0.7916 ik Bal al 
df$age 1.0991 0.9098 0.9951 1.214 


Rsquare- 0.221 (max possible- 0.94 ) 

Likelihood ratio test- 4.5 on 2 df, p=0.1052 
Wald test Sa. om 2 Gh. p=0.1162 
Score (logrank) test = 4.75 on 2 df, p=0.09296 


Saurav De (Department of Statistics PresidePartial Likelihood and Cox Proportional Hazal aun 


Saurav De (Department of Statistics Preside 


Ë Jathshala 
9 UToOSITelT f p otafinda 


m 


Conditional and Marginal Likelihood 
Module 18 


Saurav De 


Department of Statistics 
Presidency University 


Conditional and Marginal Likelihood 


MHRD 


ovt. of India 


Conditional and Marginal Likelihood  Cifjsthehats 


e Multiparameter case (i.e., number of parameters > 2) in probability 
distribution is very common, most realistic too. 


e Those cases the full likelihood will be a function of all those 


parameters. 

e But only few of those parameters become the parameters of our 
interest in true sense with respect to the inferential study. The rests 
are considered as nuisance parameters. So, we wish to eliminate them 
while estimating the parameters of interest. 


Saurav De (Department of Statistics Preside Conditional and Marginal Likelihood 2/22 


MHRD 


ovt. of India 


Conditional and Marginal Likelihood eter 


Conditional and marginal likelihoods are two popular alternative likelihood 
approaches where : 

@ nuisance parameters can be eliminated. 

e likelihood function can be framed solely as a function of parameters 

of interest. 

These approaches are useful as the conventional likelihood (known also as 
the full likelihood) can become unrealistic or fail completely when the 
setup is overburdened with many or high-dimensional nuisance parameters. 


Saurav De (Department of Statistics Preside Conditional and Marginal Likelihood 3/22 


Conditional and Marginal Likelihood  @athshals 


Conditional Likelihood Method : 


Let f(x; 0, A) : the joint density of k(> 2) random variables 
(ijk ge = X. 


Let 0 be the real or vector valued parameter of interest and A be the 
nuisance parameter (or parameter vector). 


If possible, suppose the joint density can be expressed as : 
F(x; 0, A) = (Xt. + , Xr|Xr41; mix 0) fonus -< -3 Xk; 0, A), 


i.e., joint density = conditional density x marginal density, 


where (x1, ...,Xr|Xr41,-+-,Xk) is so divided that conditional density 
becomes free from nuisance parameter A. 


Saurav De (Department of Statistics Preside Conditional and Marginal Likelihood 4/22 


Conditional and Marginal Likelihood Gamaha 


Based on a sample of size n on x, we have : 
a ICT g,A)- [ [Fou aXXo X0) [ [Fs Xu: 0, A). 
i=1 i=1 i=1 


Obviously the LHS is nothing but the conventional or full likelihood of all 
the parameters (0, A) based on full data, where as the first product in the 
RHS is called the conditional likelihood of 0 given X (= Data) on 
Xr+1;-- Xx and denote it by L(0| X) for simplicity. 


In conditional likelihood method of estimation, we exclusively focus on this 
L(0|X) and nowhere else, though the second product in RHS may also 
carry information about 0. 


Saurav De (Department of Statistics Preside Conditional and Marginal Likelihood 5) / 22 


MHRD 


ovt. of India 


Conditional and Marginal Likelihood GU ETE 


Thus the conditional likelihood method may incur some loss of information 
about the parameter of interest 0, provided 0 (or some components of 0 in 
vector-valued case) are common in the conditional and marginal densities. 


Still very often, we focus on the inference about 0 from its conditional 
likelihood or log-likelihood, e.g., the focus in econometric models like 
CLRM is typically on the parameters like 0 = (3,07) of the conditional 
density. 


Saurav De (Department of Statistics Preside Conditional and Marginal Likelihood 6/22 


Conditional and Marginal Likelihood  @fgathshals 


Note that in this method we typically choose not to specify the form of 
the marginal density and only maximise the conditional likelihood to 
obtain the conditional maximum likelihood estimators of 6. 


Conditional Maximum Likelihood Estimator : 


In this case considered above, conditional log-likelihood is : 


log L(0|X) = = ioei pcs eo Cos Xki; O). 


Saurav De (Department of Statistics Preside Conditional and Marginal Likelihood 17/92 


Conditional and Marginal Likelihood  @@athshals 


Without loss of generality assume that @ is an s-component vector, s > 1. 
Then the conditional MLE is that value of 0 which maximises log L(0|X) 
or in most cases solves the system of first order conditions (f.o.c.) 


alog L(4|X) _ o a J log L(0|X) dO log L(0|X) = 
a0 2 W OGY ^ ðb; 


i.e., the score vector is null where Glog L0) denotes the score function. 


Saurav De (Department of Statistics Preside 


Conditional and Marginal Likelihood 


8/22 


MHRD 


ovt. of India 


Conditional and Marginal Likelihood  Cifgsthehats 


The standard regularity conditions : 
Let the true conditional density be denoted by f(x | x?, 0o) where, 


x! = (a,...,x)'; X? = Goa... xx)! and ĝo, the true parameter vector. 

Then the regularity conditions used for conditional MLE are 

ô log f (x! | x2; 6o) 
00 


or in other words, the expected value of the score vector is null, 
expectation being taken with respect to the true conditional density. 


R1: Ea | =0, 


Saurav De (Department of Statistics Preside Conditional and Marginal Likelihood 


9/22 


Conditional and Marginal Likelihood  Gisthshate g 


, 0? log f(x! | x?;00) | __ Olog f(x! | x?;09) Olog f(x! | x?;09) 
R2 : £g, [=e 8009 | = Eo 00 ' 88! i 


Now R2 implies : 


0? log L(0o| X) 
ie 80 Q0! 


? 


OlogL(09|X) Olog L(6o| X) 
= Fog 00 © O8 


which are nothing but the two forms of getting (s x s) conditional Fisher's 
Information matrix. 


In fact, the true variance matrix of the score vector is : 


OlogL(09|X)V _ .[OlogL(09|X) 3log L(0o|X) 
VEU UL 28 88 


i.e., the (s x s) conditional Information Matrix. 


Saurav De (Department of Statistics Preside Conditional and Marginal Likelihood 


10/22 


Conditional and Marginal Likelihood  Mgsthsnata 


Now, we consider the asymptotic properties of conditional MLE. Assuming 


@ the true conditional density f (x! | x?; 0) is used to define the 
conditional likelihood function. 

@ f(x!|x?; 61) = f(x! | x2; 62) if and only if 04 = 62 (so the L(0|X) has 
a unique maximum). 


1 8? log L (ao 9, — 1 Ə? log L; (£o? ) 
e 05 n2ui-l — 8999 D Ao, 
where L;(0o|x;? ) = f (xi! |xi?; 09) and Apo is an (s x s) non-singular 
matrix. 


@ Regularity conditions R1 and R2 hold. 


Saurav De (Department of Statistics Preside Conditional and Marginal Likelihood 11/22 


Conditional and Marginal Likelihood GU ETE 


We have ÊcmL ES 0o (weak consistency) ; 

ÓcyL denoting conditional MLE and 

vnlÊcmL — 9o) m N.(0, — A5!) (asymptotic normality). 
Now we know from assumption 3 that, Ag = Eg, (2 tet Lips, 
Vi — 1,...,n (which is evident from Khinchinte's WLEN). 


1 0? log L(6o| X)| __ 0? log L(Oo|X) | __ 
So, Ea, E EF 50000". | & Ao or Ea, aaa | — n Ag. 


So the asymptotic distribution can also be expressed as : 


7 P i 
(Scum — 69) > N«(0, — (nAo) 7) = Ns (0 (Ea [- wl) J 


2 —1 
where (En, | Pega mo) : the inverse of the conditional Information 
Matrix. 


80 00" 


Saurav De (Department of Statistics Preside Conditional and Marginal Likelihood 197/129 


Conditional and Marginal Likelihood  Gisthshata fu 


To ensure that the conditional density will be a function of 0 only, we can 
use the following mechanism. 


For simplicity, we take k — 1, i.e., univariate case. Consider a random 
sample X;,..., X, of size n on univariate X — the joint density is now 
F(x... p we INS f(x; 973. 


Step 1 : Get the complete sufficient statistic T) of the nuisance parameter 
A, Ty = T(X3,..., Xn) = the joint pdf f(x; 0, A) can be expressed as : 


Flag, A) = f(x; ONS N="t). Fr (t0, A) Les (x) 
because Rena 
fln = 1) = ZEE. 


Saurav De (Department of Statistics Preside Conditional and Marginal Likelihood 13/22 


Conditional and Marginal Likelihood  Gisthshate y 


Note : 7) is sufficient for A = f(x; 0| T4) is independent of A. 


Step 2 : Given the data {x1,..., xn} on the random sample (X3, ..., Xn}, 
from (*) we get : 


L(0; Alx) LOG. T; = t) . L* (0, A; t). 
where LHS = full likelihood of 0, A 
and L(0| T) = t) = J [F9] Ty = t) — conditional likelihood of 0, given 
Te = 


£(0| T4 = t) = log L(6| T4 = t) = conditional log-likelihood of 8. 


Saurav De (Department of Statistics Preside Conditional and Marginal Likelihood 


14 / 22 


Conditional and Marginal Likelihood — Mgsthenata 


In this context, two cases may occur. 
e Case 1 : For fixed 0o, T,(90) depends on 65. 
e Case 2: T3(0o) = Ty, i.e., independent of 6o, V Oo. 


Following illustrations are useful to describe these two cases : 


Illustration 1 : Let Xy,..., X, ~ iiid. N(u, o°). 
Let c? — parameter of interest — u — the nuisance parameter. 


Now, joint pdf is : 


1 x 
CGS) = y peram e^ à Din FH) 


Saurav De (Department of Statistics Preside Conditional and Marginal Likelihood 157/29 


à Jathshala 
e. 


Conditional and Marginal Likelihood NO arsenal 


If T, = 352.4 X? and T, = 377 , Xj, then we know that (T,,, T2) is 
complete sufficient for (u, o?) = given o? = og, T=) XE 
sufficient for p. 


Also, T,, does not depend on oê, i.e., independent of c?, Y o = the 
present illustration is in favour of Case 2. 
Now, : 
f(x; o°, u) 
F(x: o?| T, = t) = > 
( | H ) fr, (t) 


where, t = T (x) = X; Xi = nx. 


Saurav De (Department of Statistics Preside Conditional and Marginal Likelihood 16/22 


Conditional and Marginal Likelihood  @athshals 


It can be shown that on simplification we finally get : 

f(x; 07| T geri C (a) e à Lins XY 
= L(c?| T, = t) > conditional likelihood of c? given T, = t = nx. 
or, I(c?| T,, = t) (i.e., the conditional log-likelihood) 


n 2 


= loge C(0°) M us > (dependent on c? only) 
i=1 


= we can conduct statistical inference for 72 based on the above 
conditional log-likelihood. 


Saurav De (Department of Statistics Preside Conditional and Marginal Likelihood 17/22 


Conditional and Marginal Likelihood  @athshals 


Now consider the next illustration. 
Illustration 2 : Let X1 ~ N(j, 1), Xo ~ N(p2,1) independently. 


Let 0 — s — parameter of interest and À = u2 is the nuisance parameter, 
i.e., Hi = À, 


Now, it can be derived that for fixed 0 = ĝo, the sufficient statistic for A is 
Ta (00) = Xi + 0o Xo which follows N(uı + 0o 2, 1+ 62). 


Saurav De (Department of Statistics Preside Conditional and Marginal Likelihood 18 / 22 


à Jathshala 
e 


Conditional and Marginal Likelihood MG Steerer 


Here, the sufficient statistic T and its distribution depend on 9. So this 
illustration indicates Case 1. It can be shown that the conditional 
log-likelihood /(0,09| TX(09) = t) is independent of A. 


— we can make statistical inference for 0 based on this conditional 
log-likelihood function, e.g., to find the conditional ML estimate, we can 
solve the equation 


Ot(0, Ool TX (00) = t) 
00 


lóo—6 = 0. 


= on simplification, 0 = ai 


Saurav De (Department of Statistics Preside Conditional and Marginal Likelihood 19/22 


Conditional and Marginal Likelihood ster 


General Approach : 


e When 7)(80) = T» is independent of 6o, the conditional 
log-likelihood (which now depends only on 6) will be 


Leona (0) = log fo(x| Ty) ^m log fo. (x) m log fr, (t) 
where fr, (t) denotes the density of T) at t. 


Any 0 = Êcmı that maximises Lcona (0), is a conditional ML estimate 
of 0. 


The asymptotic variance (or dispersion in vector valued case) of bom 
will be the inverse of the conditional Fisher’s Information 

leond(9) = =—-E | ait cona (O)| . 

Note that both Êcmı and leona (0), in general differ from those derived 
from the full likelihood. 


Saurav De (Department of Statistics Preside Conditional and Marginal Likelihood 20 / 22 


Conditional and Marginal Likelihood  G@@athshala 


e When T3(09) depends on 66, the conditional log-likelihood (which 
now depends on 9, ĝo, A) will be 


Ücona(0, 0o, ^) = log fo(x| Ta (80)) = log fo, (x) — log fr, (0) (t) 
where fr, (g,)(t) denotes the density of 7)(o) at t. 


Now Ücy. is a solution of LIO 0o, AJ] &y-6,A—À(0) = 0. 


The asymptotic variance (or dispersion in vector valued case) of ÊcmL 
will be the inverse of the conditional Fisher's Information 


Icona (0) —--—E [aom cond. 00, XJ 


09—0,A—À(0) ` 


Saurav De (Department of Statistics Preside Conditional and Marginal Likelihood 21 / 22 


Conditional and Marginal Likelihood eer 


vt. of India 


As a brief outline of marginal likelihood 
(also known as integrated likelihood), we can say that in the context of 
multiparameter, this likelihood is obtained by marginalizing or averaging 
out some nuisance parameters from the full likelihood model and retaining 
the parameter(s) of our interest. 


In reference with Bayesian statistics, this may be called as the evidence or 
model evidence. Maintaining the previous notational implication, let the 
nuisance parameter A ~ density p(A; 0) 


=> F(xi8) = | FOA PAO) dd 


mars ona pn ES ..+;Xn) = X, the marginal likelihood of 0 will be : 
=f f( p(à; 0) dA. 


Note : "nen likelihoods are difficult to compute and the particular 
solutions are available for a small class of distributions, otherwise 
numerical methods are applicable. 


Saurav De (Department of Statistics Preside Conditional and Marginal Likelihood 207/29 


Saurav De 


Department of Statistics 
Presidency University 


ə For the probability models having more than one unknown 
parameter, we often concentrate in the inference of one parameter of 


interest or of a real-valued function of the parameters. 


e For illustration, let 0 be the parameter of interest and 6 be the 


(vector of) other parameter(s). 


Profile Likelihood and Associated Confidence Interval 


Let L(0,6) be the likelihood function. 

Then, profile likelihood function of 0, denoted by L1(0) is = maxs L(0,6) 
for each value of 0, i.e., profile likelihood function of 0 obtained by 
maximising out the remaining parameters(s) 6 for each fixed value of the 


parameter of interest. 


Profile Likelihood and Associated Confidence Interval 


It provides : 


@ a graphical summary of the most plausible value of 0. 
@ parameterization invariant procedure. 


e corrected confidence intervals that are more accurate than Wald 
Type confidence intervals. 


Profile Likelihood and Associated Confidence Interval 


Profile Likelihood and Associated Confidence 


Interval 


Note : Wald Type confidence interval (Cl) may not perform satisfactorily 
if either : 
e distribution of the parameter estimator is notably skewed, or 
e standard error is a poor estimate of the standard deviation of the 
estimator. 
Now, asymptotic standard errors under GLMs are obtained from the 


information matrix. 
=> Wald's Cl may work poorly for small to moderate sample sizes. 


Profile Likelihood and Associated Confidence Interval 


On the other hand, confidence intervals based on profile likelihood don't 
need normality assumptions of the estimator, and perform better than 


Wald Type CI for small sample cases. 


Anyway, the asymptotic null distribution of the profile log-likelihood ratio 
test statistic is chi-square. 


Profile Likelihood and Associated Confidence Interval 


In general, consider a statistical model with p-dimensional parameter and 
log-likelihood /(@). Let the parameter function y = (6) be the 
parameter of interest for inferential work. 


Let Ê denote the overall MLE of 0 and let = v(0). 


Further, let 0(7) be the MLE of 0 for fixed y. 
= profile log-likelihood of 1»; /p(~) = I(O(w)), Y v. 


Profile Likelihood and Associated Confidence Interval 


Interesting features of /, : 


ə it is invariant under one-to-one reparameterization of 0, and 
ə it is defined for any log-likelihood /(@); no special structure like 
exponential family, location-scale model, etc. is needed for its 
definition. 
But, computation of /;/() requires a constrained maximization for each 
fixed value of Y = Direct computation of l (1) is difficult. 


Profile Likelihood and Associated Confidence Interval 


Profile Likelihood and Associated Confidence 


Interval 


Computation of the profile likelihood : 


The curve Ó(u») is computed on a grid and then the log-likelihood I(0) is 
evaluated along that curve. In this context, the strategy for computation 
of 0(1/) is as follows : 

@ Start at Ó. 

@ Take a small step in the direction of the tangent. 

© Apply Newton-Raphson iterations to move back onto the curve and 


hence find out a point on that curve. 


(continued onto next slide......) 


Saurav De Profile Likelihood and Associated Confidence Interval 


idence 


Q Repeat steps 1 and 2 such that one moves along the curve Ô(ọ) and 
find a set of points exactly lying on that curve. 


Q Stopping rule : We stop against a sufficiently large amount of 
decrease in the profile log-likelihood. 


Q Next, we go back to Ó and move to the Opposite direction. 


Let \ be the Lagrangian multiplier for maximising /(@) subject to 


(9) = v. 


Profile Likelihood and Associated Confidence Interval 


Stein's Least Favourable Family : 


Define a function, g(0) = I(0) + A(v(0) — v) of p-dimensional parameter 
0, where A is the Lagrangian multiplier for maximising /(0) subject to the 
constraint (0) = wv. 


Let Vg(0) = (3:09. vida 2:25] — gradient of g. 


Then, (^) is the solution to Vg(0) = 0 and A(w) be the solution for A. 


Profile Likelihood and Associated Confidence Interval 


=> at v — i), A(W) = 0 and the derivative of Ó(4») ox (v?I(0)) -! vu(0) 
evaluated at 0. 


Then the line through Ê having the said direction is known as Stein’s 
Least Favourable Family. 

The family is used in place of CD for an approximation to the profile 
log-likelihood. 


Profile Likelihood and Associated Confidence Interval 


Some simple applications of profile likelihood : 


Let X;,..., Xn be i.i.d. with common pdf : 
fa o(x) = ge m. x2a. 
= 0, otherwise. 
Here, 0 < 0 < oo, —oo < a < oo. 
— for each fixed 0, the profile log-likelihood : 
Il,(0) = max l(o,0), I(o,0) : fulllog — likelihood of (a, 6). 


E d RU xı) za 


= —nlog. — x E) V6» 0. 


Profile Likelihood and Associated Confidence Interval 


Next, 


2 
dó 2. 6 
Also, 
d O)| n (xi= xq) n 
dB, @ 22 = p<? 


=> 6 is the maximum profile likelihood estimate of 0. 
Note that, here Ó is the parameter of interest. 


Profile Likelihood and Associated Confidence Interval 


idence 


Similarly, for X1,...,Xn ~ i.d. N(u, o?), the profile log - likelihood of 
c? will be : 


n [s xe mo 
lo^) max I(u, o?) = max { const. 5 (Oger ; » ( = ) \ Vo? >0 


ns? 
202’ 


= sample variance with divisor n. 


n 
= const. 5 log.c? V o? 


where, s? 


Using maxima-minima principle, we can show that the maximum profile 
likelihood estimate of o? is s? which corresponds to the usual MLE of o°. 


Profile Likelihood and Associated Confidence Interval 


Profile Likelihood and Associated Confidence 


Interval 


Profile Likelihood Confidence Interval : 

Idea of a profile likelihood confidence interval (CI) is to invert a likelihood 
ratio test to obtain a Cl for the parameter of interest. 

A simple approach for obtaining Cls from /,(y) is to form the set of 


values w such that 2[/ (2) — [,(v)] > C, i.e., (v : 2[5 (9) — I ()] > C}. 


For 100(1 — a)% Cl of v», choice of C equals to CRT the upper 100270 
point of x? distribution with 1 degrees of freedom (d.f.). This is because 
the asymptotic distribution of the LR-statistic is here x7. 


Saurav De Profile Likelihood and Associated Confidence Interval 


Profile Likelihood and Associated Confiden 


Interval 


The above choice of C produces a CI with an overall coverage 
l-a+ O( 5). 


To achieve higher accuracy in each tail, we consider the signed square 
root of the LR statistic, i.e., R(9) = sgn(9 — v)/2[I5 (2) — Io(v))]. 


Then the set {a : R(w) > VC} is identical to the set 

[v : 2 [I (0) — I(v)] > C). Obviously, for 100(1 — a)% CI of «, 
asymptotically VC is chosen as Te; upper 1005% point of N(0, 1) 
distribution. 


Profile Likelihood and Associated Confidence Interval 


Profile Likelihood and Associated Confidence 


Interval 


Note : With R(w) also, the coverage error is 0(75). But the advantage 
is that we can have Cls of higher accuracy in each tail by correcting R(w) 
to make its distribution closer to N(0, 1). 


McCullah (1984) and DiCiccio (1984) have shown that under 
one-parameter model, in some sense, Cls based on R(1/) is as good as 
those obtained from approximate pivots 

= R(w) might be looked upon as the 'jackknife' of parametric models, 
as it is simple and can be applied automatically. 


Saurav De Profile Likelihood and Associated Confidence Interval 


Logit Model Example : 


Let y; ~ Bin(n;, 77), i=1,...,k. 
We fit a logit model : 


loge (7) = Bot Baxi... (=) 


where x; — it” value of a single covariate x. 
— the log-likelihood function is : 


k 


logL(Bo, 81) = 5 b log ( — —) + njloge(1 — mi) + loge (7) 


i=1 ! 


k 
2 fes (2) + yi(Bo + B1xi) — ni loge(1 + ehe) 


Profile Likelihood and Associated Confidence Interval 


Naturally, 6, is the parameter of interest and fg the nuisance parameter. 


> the profile log-likelihood of 5 is : 


(Bi) = max loge L(Bo, 61), V Bi. 


Now manually it is very difficult and time consuming to get /5(1) as well 
as maximization of /[;(£1) to get maximum profile likelihood estimate of 
Pr. 

The R function confint(m) where m is the fitted logit model, directly 
provides the profile likelihood Cl of 5. 


Profile Likelihood and Associated Confidence Interval 


Gg mcn 


ITT 


Quasi-Likelihood Method of Estimation 
Module 20 


Saurav De 


Department of Statistics 
Presidency University 


Saurav De (Department of Statistics Preside 


Quasi-Likelihood Method of Estimation 


ul 


i 


MHRD 


ovt. of India 


Quasi-Likelihood Method of Estimation a 


In order to construct a likelihood function the probability distribution(s) of 
the random sample observations should be specified. 


But unfortunately(!) often those probability distributions are not available 
except few information revealing few aspects of the data, like 


e the domain of the sample responses; 


Continued ... 


Saurav De (Department of Statistics Preside Quasi-Eikelihood Method of Estimation 2 if Oi 


MHRD 


ovt. of India 


Quasi-Likelihood Method of Estimation a 


e how the mean or median response is affected by external stimuli or 
covariates; 
e how variability of the response changes with the average response; 


e whether the observations are statistically independent; 


e whether the response distribution under fixed covariate conditions is 
symmetric or skewed. 


Saurav De (Department of Statistics Preside Quasi-Eikelihood Method of Estimation Sf 2a 


MHRD 


ovt. of India 


Quasi-Likelihood Method of Estimation a 


To draw inference about the parameter even based on the insufficient 
information as above, the likelihood function so constructed is called the 
Quasi-likelihood function. We concentrate mainly on the case where the 
observations are statistically independent and where the effects of interest 
can be described by a model for expected response say p. 


Saurav De (Department of Statistics Preside Quasi-Likelihood Method of Estimation 


4/27 


Quasi-Likelihood Method of Estimation Ag prec 


Some common and relaxed assumptions: 


@ Let the response vector: Y = (Y;,..., Yn)’. The components of the 
response vector are independent with 
E(Y) — (u, DF Min). , b= E(Y;) 


And Disp(Y) = o?V (u), where 

o? may be known, and 

V(u) is a matrix of known functions. Obviously 

V(t) = Diag (Vi(p),..., Va(u)} as the components of Y are 


independent => Cow Y ROLL EET 


Saurav De (Department of Statistics Preside Quasi-Likelihood Method of Estimation 52T 


Quasi-Likelihood Method of Estimation DE 


e The parameter of interest 3 relates the dependence of yz to covariate 
vector x. Thus the symbol u( 6) denotes the regression function 
depending on x. 


ə o? : constant; at least w.r.t. B. 


e Vi(u) = Vi(ui) i.e. depends only on the ith component of u; not 
over the other components. 


= V(n) = Diag (Vi (un)... Valin) - 


Saurav De (Department of Statistics Preside Quasi-Eikelihood Method of Estimation 6/27 


Quasi-Likelihood Method of Estimation GQathsnata 


Construction of Quasi-likelihood function: 


First consider iid observations Y1, Y2,..., Yn with common mean p and 
common variance c? V(u). Now consider the function 


n Y; — 
U(u; Y) = 2 SY 
i=1 


U has the following properties: 
e E[05c Y)],— 0 
e V [U(u; Y)] = zzv; 


Saurav De (Department of Statistics Preside 


Quasi-Likelihood Method of Estimation eet 


Quasi-Likelihood Method of Estimation ster Teo di. 


Let us check : 


n 


ElUGs = Sayegh = Losey = es E) =H) 


V[U Y) = Yu (as Y/s are independent) 
m 
i eu (as V(Y;) = c? V (u) under iid case) 
~ ev(u) 


Saurav De (Department of Statistics Preside Quasi-Eikelihood Method of Estimation 8/27 


Quasi-Likelihood Method of Estimation Cifgsthehats 


And 
V(4)Y 5731) - VG) XY — a) 
QU _ = c 
Ou c? V*(u) 
— 
nV(u) + V'(u)$ EY) — 
5 tes x - i z _ nV(u)+ V'(u) x 0 
Op c? V?(u) o? V?(u) 
= NND) ND 


Saurav De (Department of Statistics Preside Quasi-Likelihood Method of Estimation 9/27 


Quasi-Likelihood Method of Estimation a 


These properties are also true for the score function an provided 
e Y;s have a distribution in the exponential family with form of pmf or pdf 
ü 6 — b(6) 
yo — 
fo(y) — exp (EVE 
p a(9) 


(Member of Exponential Dispersion Family) 


+ c(y, a} 


where 0 : Canonical parameter 
b(0) : Cumulant function 
9 : Dispersion parameter 


Saurav De (Department of Statistics Preside Quasi-Likelihood Method of Estimation 


10/27 


Quasi-Likelihood Method of Estimation a B 


Before realize this fact know that 

Y ~ foly) => u(- E(Y)) = b'(0) and Var(Y) = b'(0)a(6) 
Verify: 
We know f fj(y)dy 21 = Z5 f f(y)dy =0 


< f 2 fol y)dy = 0 (interchanging the order of integration and 
differentiation) 


<=> [f {Glog h(y)} fs(y)dy =0 


Or E (& log fj(Y)) 20 ... ... (x) 


Saurav De (Department of Statistics Preside Quasi-Eikelihood Method of Estimation don 


Quasi-Likelihood Method of Estimation a Qotnshata 


But here log fo(y) = ye D ) 4 c(y,ġ) 


b/ (0 
= & log fo(y) = oP 


Or E (dh log f(Y)) = SAM) as E(Y) = 


(+) => u= b'(0). 
Again 2; f ( 5 log f(y) )} fo(y)dy 2 0 


Once again interchanging the order of integration and differentiation 


2 2 
Ne log f(y) + ( 55 oe fly) | soy =0 


Continued ... 


Saurav De (Department of Statistics Preside 


Quasi-Likelihood Method of Estimation 


12/27 


Quasi-Likelihood Method of Estimation a 


= E (Flog h(Y)) +E (& logo)? =0. (e) 


ui 2 b 2 
Now here 25 log f(y) = — 6) and ($5 logfo(y)) = r 


So («aft zo + ae =0 (as P'(0) 2 u and E(Y — u} = 


Or Var(Y) = b'(0)a(à) = o?V(y) 


[As 0 expressible as a function of ui, — b"(0) = V (gy); another function of 


p.] 


Saurav De (Department of Statistics Preside Quasi-Eikelihood Method of Estimation 130/297) 


Quasi-Likelihood Method of Estimation a Qotnshata bd 


Now go back to realising the fact related to Score function Èl. 


Based on a random sample (Y;,..., Yn) of size n, 


Y Lo0- 0) 


the loglikelihood function /(0; y) — log fa(y) — ELI + X elyi, $) 


So like (x) and (*«) here also get 
e F(Z) 20... ... (8) 
° E( 52 1(8)) EE (BO? —0 ... ... (28) 


Saurav De (Department of Statistics Preside Quasi-Likelihood Method of Estimation 


14/27 


Quasi-Likelihood Method of Estimation ster p. de 


2 nb" (0 
Also $71(0) = — 2:0 


and 26 = 1/35 = 1/b"(0) (as u = b'(0)) 


Now E( 3 (0) = Se (2,1(0)) =0 (from (@)) ... ... (e) 


Saurav De (Department of Statistics Preside Quasi-Likelihood Method of Estimation 


15/27 


Quasi-Likelihood Method of Estimation ster Monti. 


8? 9 9, 


o (000 
= gems} 


POD yyy 0 Dy 
Ou? 00 Ou Op 00 

?0 Ə 80 00 Ə? 

8,2 96 )4 Bi ön 00? (8) 
08 O ep 1 1 nB() 
Ou? 00 b"(8) b"(8) alo) 


T e (- 50) = pou Ela (0) = 0 by (8) 


Saurav De (Department of Statistics Preside Quasi-Fikelihood Method of Estimation 16 / 27 


Quasi-Likelihood Method of Estimation Ag athbshala 


A TSITAAT eis . of India 
= E(-$0) = 


— ZVG a = Vary) Ve. NE (ec, 
Now 


2 
But from (@@), E (£10) = 5 


= var ( 


P 
S 
V 


Saurav De (Department of Statistics Preside 


Quasi-Likelihood Method of Estimation dy don 


MHRD 


ovt. of India 


Quasi-Likelihood Method of Estimation ster 


(e), (ee) and (eee) ensure the fact that U(u; Y), defined earlier bears 
the same characteristic features as the characteristics a score function i! 
reflects in case of an exponential (-dispersion) family member. 


In other words U(y; Y) can play the role of a score function even when the 
knowledge about the probability law of the random sample observations is 
nil or too insufficient to construct the original score function xl. 


Saurav De (Department of Statistics Preside Quasi-Likelihood Method of Estimation 


18/27 


Quasi-Likelihood Method of Estimation a bd 


Now define 


ES 


Observe that 


p-l 
ns 

<0? V (u) 
= U(uY)-U 


= > Q(u; Y) behaves like a loglikelihood function 
(where U behaves like ir |, the log-likelihood function.) 


Saurav De (Department of Statistics Preside  Quasi-Likelihood Method of Estimation 19 / 27 


Quasi-Likelihood Method of Estimation a 


We call U the quasi-score function and Q, the quasi-likelihood function or 


more particularly the log quasi-likelihood function (because it resembles to 
the loglikelihood function). 


These two are the elegant instrumentrs for executing the likelihood 
method under lack of sufficient information regarding the population 


probability model. 


Following illustration can help for better understanding. 


Saurav De (Department of Statistics Preside Quasi-Likelihood Method of Estimation 


20 / 27 


o 
Quasi-Likelihood Method of Estimation Ak ween 
Illustration. Let c = 1 and V(u )= u(1 — u) , 0 & p & 2. 1uen 


nu 
: m yi-t 
qw» = Y [qi 


Il 
il 


yi 
d dt dt 
= ooo o =, H 
2, HR TEL t) 
j=l 
H - 7 yi 
= In ——— yi — yi In ( -) 
Ex 3 1—y 


Saurav De (Department of Statistics Preside Quasi-Likelihood Method of Estimation OM, ff Qi 


Quasi-Likelihood Method of Estimation CQ athsnata 


‘QUST 
Similarly if V(u) = ku, k > 1 being known, 
n g t 
. = yi — 
quy) = Y ft 
i=l "y, 
yi 
" u 
P JŽ- d 
~ A)" | ke k 
E Ji Ji 
(Inu + 1) 5 lc a Ab 
k dv pa ir: 


In both the examples our job is to evaluate that choice of u that 


maximises Q(j; y). In this context numerical method of solution can be a 
great help for us. 


Saurav De (Department of Statistics Preside Quasi-Likelihood Method of Estimation poro 


MHRD 


ovt. of India 


Quasi-Likelihood Method of Estimation Cifgsthehats 


Overdispersion and Quasi-likelihood : 


First we will discuss about the problem 'Overdispersion'. 

According to the nature of response variable in general, we assume or fix a 
suitable probability distribution for it. The variance (provided it exists) of 
the distribution is called Nominal Variance. 

Now, due to some additional arrangement like classification or clustering 
etc. on the response variable, if the variance of the distribution of response 
variable exceeds its nominal variance, the situation is termed as 
Overdispersion. 


In reality overdispersion is not a rare or uncommon issue. In fact it will not 
be an exaggaration if we say that overdispersion is the real practice and 
having nominal variance is the exception. 


Degree of overdispersion depends on the field of applications. 


Saurav De (Department of Statistics Preside Quasi-Likelihood Method of Estimation 23/27 


OTST iT 


The problem of overdispersion encounters where the nominal variance is a 
function of the mean in true sense i.e. where the response distribution 
belongs to exponential dispersion family; a family which leads to 
generalised linear model of regressions under the presence of suitable 
covariate(s). 


Quasi-Likelihood Method of Estimation Cifgsthehats 


In reality binomial, poisson, negative binomial, gamma type response 
response variables mostly lead to this problem; although any type of 

response having distribution from the aforesaid family may experience 
overdispersive phenomenon. 


e Below we discuss overdispersion problem under binomial type 
response cases: 


Suppose we consider m binary responses (resulting in 0 or 1 value 
corresponding to ‘failure’ and ‘success’ outcome) obtained from a 
population having r (fixed) clusters (Families, litters, colonies etc. are the 
very common examples of naturally occurring clusters). 


Saurav De (Department of Statistics Preside — Quasi-Likelihood Method of Estimation 24/27 


MHRD 


ovt. of India 


Quasi-Likelihood Method of Estimation Cifgsthshats 


For symplicity, consider k = 7 is the common cluster size (it could vary 


also). 

let 7; denote the success probability in ith cluster; 0 < a; < 1; and it 
varies from cluster to cluster. 

Let Z; = # successes in ith cluster == Z; ~ Bin(k,7;) independently. 


Hence the total number of successes out of m Bernoulli outcomes will be 


Y —Z + Za. + Z, 


As 7s also vary, we can assume a probability distribution (prior 
distribution) of ms 3 E(m;) = 7, say. 

Now it is the most common experience that 
V(rj = en(1—7),0<c<1. Why? 


Saurav De (Department of Statistics Preside Quasi-Likelihood Method of Estimation 25 27 


Quasi-Likelihood Method of Estimation Ag BN 


The most common prior of Bernoulli success probability is Beta prior. So 
without loss of generality suppose 
iid 


T; ~ Beta (a, b) " a,b > 0 = E(t i EB and V(ri) = CTT: 


a b 
Thus Pa eens ZEL 1 — «. Hence 


Now E(Y) = EE(Y|n/s) = EY kr; = rkr = m. 
i=1 


inflar vV(Y)= EV(Y Ins) + VE(Y|nis) = 


EY ke 1—7;)+ PB? kx; = = o?mn(1 — T) (on simplification) where 
J i=1 


a? — 1-4 (k — 1)c > 1 with = iff k — 1. 


Saurav De (Department of Statistics Preside Quasi-Likelihood Method of Estimation 


26 / 27 


Quasi-Likelihood Method of Estimation Cifgsthshats 


Thus under k > 1, V(Y) > mz(1-— 7). 

But, without the knowledge of clustering we just consider Y ~ Bin (m, 7) 
which implies V(Y) = mz(1 — x), the nominal variance. 

Hence, under k > 1, the consideration of binary response data from 
k-clustered population invites overdispersion. 


Obviously the probability distribution of response Y will not have any 
closed form of its pmf. Thus we can successfully form here the quasi 
likelihood function of m and get its estimate as before. 

This way overdispersive case is a relevant arena where quasi likelihood 
method of estimation can be successfully applied. 


Similarly overdispersion can creep into the cases of Poisson type response 
like count data. In this context readers are suggested to go through 
overdispersion issues spread in some chapters of Generalized Linear 
Models (Second Edition) by McCullagh and Nelder for further study. 


Saurav De (Department of Statistics Preside — Quasi-Likelihood Method of Estimation 7/2 


